60 DAYS OF PRO — FREENO CREDIT CARD REQUIRED60 DAYS OF PRO — FREESTART BUILDING TODAY60 DAYS OF PRO — FREENO CREDIT CARD REQUIRED60 DAYS OF PRO — FREESTART BUILDING TODAY60 DAYS OF PRO — FREENO CREDIT CARD REQUIRED60 DAYS OF PRO — FREESTART BUILDING TODAY60 DAYS OF PRO — FREENO CREDIT CARD REQUIRED60 DAYS OF PRO — FREESTART BUILDING TODAY
← Back to blog

Part 4: GGUF Conversion and Serving with llama.cpp

·Ragtoolina Team

Why GGUF?

Most ML tutorials end at model.save(). You've got a directory of safetensors files, a config.json, a tokenizer.json, and you load it with SentenceTransformer("./my-model") in Python. Done.

Except we're not running Python. Ragtoolina is a native macOS app written in Swift. The embedding model runs through llama-server — the HTTP server from the llama.cpp project. llama.cpp reads one format: GGUF (GPT-Generated Unified Format).

GGUF has three properties that make it perfect for our use case:

  1. Single file. One .gguf file contains the model weights, tokenizer, and architecture metadata. No directory of 15 files. Easy to ship in an .app bundle.
  2. Quantization built in. The format natively supports quantized weights (Q8_0, Q5_K_M, Q4_K_M, etc.). A 286MB FP16 model becomes 153MB at Q8_0 with negligible quality loss.
  3. Metal acceleration. llama.cpp has a mature Metal backend. On Apple Silicon, the model runs on the GPU with zero configuration — just pass -ngl 99 to offload all layers.

The Conversion Pipeline

The pipeline has two stages: convert HuggingFace format to GGUF FP16, then quantize to the target precision.

Prerequisites: Building llama.cpp

ModernBERT support in llama.cpp was merged on December 22, 2025 (PR #15641). You need a build from after that date. We vendor llama.cpp in our repo:

git clone https://github.com/ggml-org/llama.cpp vendor/llama.cpp
cd vendor/llama.cpp
cmake -B build
cmake --build build --config Release

This builds two tools we need:

  • convert_hf_to_gguf.py — Python script that reads HuggingFace model files and writes GGUF
  • llama-quantize — Binary that reads a GGUF file and writes a quantized GGUF file

Stage 1: HuggingFace → GGUF FP16

python3 vendor/llama.cpp/convert_hf_to_gguf.py \
  models/ragtoolina-embed-v1-hf \
  --outfile models/ragtoolina-embed-v1-f16.gguf \
  --outtype f16

This reads the safetensors weights, maps the ModernBERT layer names to llama.cpp's internal tensor naming convention, and writes a single GGUF file in FP16 precision.

Output: ragtoolina-embed-v1-f16.gguf — 286 MB

This is the lossless reference. Every subsequent quantized variant is derived from this file.

Stage 2: Quantize

# Q8_0: 8-bit quantization (production)
vendor/llama.cpp/build/bin/llama-quantize \
  models/ragtoolina-embed-v1-f16.gguf \
  models/ragtoolina-embed-v1-q8_0.gguf \
  Q8_0

# Q5_K_M: 5-bit mixed quantization (lightweight)
vendor/llama.cpp/build/bin/llama-quantize \
  models/ragtoolina-embed-v1-f16.gguf \
  models/ragtoolina-embed-v1-q5km.gguf \
  Q5_K_M

The Three Variants

| Variant | Size | Precision | Use Case | |---------|------|-----------|----------| | FP16 | 286 MB | 16-bit float | Reference, benchmarking | | Q8_0 | 153 MB | 8-bit integer | Production default | | Q5_K_M | 113 MB | 5-bit mixed | Disk-constrained deployments |

Why Q8_0 for production? Embedding models are more tolerant of quantization than generative models. A generative model needs precise logits to produce coherent text. An embedding model only needs to preserve the relative ordering of similarity scores — which vector is closest to the query. Q8_0 preserves this ordering almost perfectly while halving the file size.

Why offer Q5_K_M? It's 26% smaller than Q8_0 (113 MB vs 153 MB). In our benchmarks, it drops MRR@5 by about 2.7% (0.568 → 0.553). For some users, the size savings are worth it.

Why not Q4? Below Q5, we start seeing meaningful degradation in ranking quality. The trade-off isn't worth the 20-30 MB savings. Our spec draws the line at Q5_K_M.

Smoke Testing

Before shipping, we run three automated smoke tests:

#!/bin/bash
# Start llama-server
llama-server \
  --model models/ragtoolina-embed-v1-q8_0.gguf \
  --port 8384 --embeddings --pooling cls -ngl 99 \
  --ctx-size 8192 --ubatch-size 8192 --batch-size 8192 &

sleep 3

# Test 1: Health check
curl -s http://localhost:8384/health | jq .status
# Expected: "ok"

# Test 2: Single embedding returns 768 dimensions
DIMS=$(curl -s http://localhost:8384/v1/embeddings \
  -d '{"input": "test query", "model": "test"}' \
  | jq '.data[0].embedding | length')
echo "Dimensions: $DIMS"
# Expected: 768

# Test 3: Batch of 12 returns exactly 12 embeddings
COUNT=$(curl -s http://localhost:8384/v1/embeddings \
  -d '{"input": ["a","b","c","d","e","f","g","h","i","j","k","l"], "model": "test"}' \
  | jq '.data | length')
echo "Batch count: $COUNT"
# Expected: 12

These three tests catch the most common failure modes:

  1. Model doesn't load (wrong GGUF version, missing tensors, architecture mismatch)
  2. Wrong embedding dimensions (pooling misconfiguration, truncated output)
  3. Batch processing broken (context size too small, ubatch misconfigured)

Serving Configuration

The full llama-server command in production:

llama-server \
  --model ragtoolina-embed-v1-q8_0.gguf \
  --port 8384 \
  --embeddings \
  --pooling cls \
  -ngl 99 \
  --ctx-size 8192 \
  --ubatch-size 8192 \
  --batch-size 8192

Let's break down each flag:

--embeddings

Enables the /v1/embeddings endpoint. Without this, llama-server only exposes /v1/completions for text generation. The embeddings endpoint accepts an array of strings and returns an array of float vectors — OpenAI-compatible format.

--pooling cls

This is the critical flag. ModernBERT uses CLS token pooling — the first token's hidden state becomes the embedding. The default pooling in llama-server is mean pooling (average all token hidden states), which is correct for nomic-embed but wrong for our model.

If you set this wrong, your embeddings will still be 768-dimensional vectors. They'll just be bad. No error, no warning — just degraded search quality. We caught this during initial benchmarking when the vanilla gte-modernbert scored lower than expected with mean pooling.

-ngl 99

Offload 99 layers to the GPU. Since our model only has ~24 layers, this effectively means "put everything on GPU." On Apple Silicon, this uses Metal. The model runs entirely on the GPU's unified memory.

--ctx-size 8192

Maximum sequence length. Matches the model's trained context window. In practice, our code chunks rarely exceed 500 tokens, so this is generous headroom.

--ubatch-size 8192 and --batch-size 8192

These control internal batching. batch-size is the maximum number of tokens processed per batch. ubatch-size is the micro-batch size for the attention computation. Setting both to 8192 allows the server to process our 12-chunk batches (each chunk ~100-500 tokens) in a single pass.

Integration with the Swift App

The Ragtoolina Swift app manages llama-server as a child process:

// LlamaServerManager.swift
let process = Process()
process.executableURL = Bundle.main.url(forResource: "llama-server", withExtension: nil)
process.arguments = [
    "--model", modelPath,
    "--port", "8384",
    "--embeddings",
    "--pooling", "cls",
    "-ngl", "99",
    "--ctx-size", "8192",
    "--ubatch-size", "8192",
    "--batch-size", "8192"
]

Embedding requests go through LocalLlamaEmbedding.swift, which sends POST requests to http://localhost:8384/v1/embeddings in batches of 12 chunks:

let body = ["input": chunks, "model": "ragtoolina-embed-v1"]
// POST to http://localhost:8384/v1/embeddings

No query/document prefixes. The old nomic integration required prepending "search_query: " or "search_document: " to every input. With gte-modernbert, the text goes in as-is. One fewer string concatenation per chunk, one fewer potential bug.

Performance

Measured on MacBook Pro M4 Pro, batch size 12:

| Metric | Value | |--------|-------| | Per-chunk latency | 2.2ms | | Batch (12 chunks) | 26ms | | Batch min | 24.3ms | | Batch max | 29.0ms | | Cold start (model load) | ~2 seconds |

For comparison, nomic-embed-text-v1.5 at Q8_0: 1.9ms per chunk. Our model is 16% slower — a consequence of the slightly larger parameter count (149M vs 137M) and the ModernBERT architecture having more compute per layer (GeGLU has 3 weight matrices vs GELU's 2). At 2.2ms per chunk, indexing a 10,000-file project takes ~22 seconds for the embedding step alone.

The target in our spec was ≤50ms per chunk. We're at 2.2ms — well within bounds.

A Note on the CLS vs Mean Pooling Decision

This is worth emphasizing because it's the most common mistake when deploying embedding models via llama.cpp.

Different embedding models use different pooling strategies:

| Model | Pooling | llama-server flag | |-------|---------|-------------------| | nomic-embed-text-v1.5 | Mean | --pooling mean (default) | | gte-modernbert-base | CLS | --pooling cls | | bge-base-en-v1.5 | CLS | --pooling cls | | jina-embeddings-v2 | Mean | --pooling mean |

There is no runtime error if you use the wrong pooling. The server happily produces 768-dimensional vectors either way. But the vectors are meaningfully different:

  • CLS pooling: Takes the hidden state of the [CLS] token (position 0). The model was trained to put a sentence-level summary in this position.
  • Mean pooling: Averages the hidden states of all tokens. More robust for models that weren't trained with a dedicated CLS objective.

If you use mean pooling on a CLS-trained model, you're averaging the summary token with all the content tokens, diluting the learned representation. Your MRR drops by 10-20% and you don't know why.

Always check the model card for the correct pooling strategy.


Next: Part 5 — Benchmarking and Results