Part 5: Benchmarking and Results

February 24, 2026·Ragtoolina Team

Why Build a Custom Benchmark?

Standard embedding benchmarks like MTEB and CoIR are useful for model selection — we used CoIR scores to choose gte-modernbert-base in Part 1. But they measure generic code retrieval performance across thousands of diverse queries. They don't tell you how the model performs on your specific retrieval task.

Our task is specific: developers search a single codebase using short natural language queries (or class/function names), and expect the right file to appear in the top 5 results. The corpus is typically 50–500 code chunks from one project. The queries range from exact symbol lookups ("LlamaServerManager") to fuzzy semantic queries ("how to start the embedding server").

No public benchmark tests this. So we built our own.

The Benchmark Design

Corpus: 97 Code Chunks

We indexed the Ragtoolina Swift project — 97 code chunks extracted by our production TreeSitter → ChunkingStrategy pipeline. These are the same chunks a real user would search through.

The chunks come from files like:

LlamaServerManager.swift — manages the llama.cpp server process
VectorStore.swift — Qdrant vector database client
TreeSitterParser.swift — source code AST parsing
EmbeddingService.swift — embedding request orchestration
MCPServerManager.swift — Model Context Protocol server
Plus 90+ more files covering UI, models, utilities, API clients

Queries: 29 Hand-Crafted Test Cases

We wrote 29 queries across 5 categories, each with one or more expected file matches:

Exact name (8 queries): Direct symbol lookups.

{"query": "LlamaServerManager", "expected": ["LlamaServerManager.swift"]}
{"query": "ChunkingStrategy", "expected": ["ChunkingStrategy.swift"]}
{"query": "VectorStore", "expected": ["VectorStore.swift"]}
{"query": "RAGChunk", "expected": ["RAGChunk.swift"]}

Semantic (10 queries): Natural language descriptions.

{"query": "how to start the embedding server", "expected": ["LlamaServerManager.swift"]}
{"query": "split code into chunks for embedding", "expected": ["ChunkingStrategy.swift"]}
{"query": "parse source code into AST symbols", "expected": ["TreeSitterParser.swift"]}
{"query": "manage Qdrant vector database connection", "expected": ["VectorStore.swift"]}

Cross-language (3 queries): Language-agnostic concepts that map to specific implementations.

{"query": "embedding dimension configuration", "expected": ["LocalLlamaEmbedding.swift", "Config.swift"]}
{"query": "vector similarity search", "expected": ["VectorStore.swift"]}
{"query": "batch embedding processing", "expected": ["LocalLlamaEmbedding.swift"]}

Code pattern (5 queries): Structural patterns.

{"query": "async function that calls OpenAI compatible API", "expected": ["LocalLlamaEmbedding.swift"]}
{"query": "protocol with default implementation", "expected": ["VectorStoreProtocol.swift"]}
{"query": "Swift struct with Codable", "expected": ["RAGChunk.swift", "Config.swift"]}
{"query": "NestJS injectable service", "expected": ["embeddings.service.ts"]}

Error handling (3 queries): Resilience and error recovery.

{"query": "handle server crash and restart", "expected": ["LlamaServerManager.swift"]}
{"query": "timeout configuration for HTTP requests", "expected": ["LocalLlamaEmbedding.swift"]}
{"query": "retry logic for failed operations", "expected": ["EmbeddingService.swift"]}

Metrics

MRR@5 (Mean Reciprocal Rank at 5): If the expected file is the 1st result, MRR=1.0. If it's 2nd, MRR=0.5. If 3rd, MRR=0.333. If it's not in the top 5, MRR=0.0. Averaged across all queries.

MRR measures how high the correct file ranks. A model with MRR=0.8 typically puts the right answer at position 1 or 2. A model with MRR=0.4 buries it at position 3–5 or misses it entirely.

Recall@5: What fraction of queries have the expected file anywhere in the top 5 results. Binary per query — either it's there or it isn't.

The Harness

The benchmark script:

Starts llama-server with the model under test
Embeds all 97 corpus chunks
For each of 29 queries, embeds the query and computes cosine similarity against all chunks
Ranks by similarity, checks if expected files appear in top 5
Records MRR, Recall, cosine scores, and per-query results
Repeats embedding 20 times to measure latency statistics

Everything runs through the same llama-server endpoint used in production — we're benchmarking the real inference path, not a Python-only simulation.

The Results

Head-to-Head Comparison

| Model | MRR@5 | Recall@5 | Avg Cosine | Speed | |-------|-------|----------|------------|-------| | nomic-embed-text-v1.5 Q8 | 0.442 | 69.0% | 0.624 | 1.9ms | | gte-modernbert-base Q8 (vanilla) | 0.558 | 72.4% | 0.607 | 2.2ms | | ragtoolina-embed-v1 Q8 | 0.568 | 69.0% | 0.465 | 2.2ms | | ragtoolina-embed-v1 Q5_K_M | 0.553 | 69.0% | 0.453 | 2.3ms |

The headline number: ragtoolina-embed-v1 achieves 0.568 MRR@5, a +28.5% improvement over nomic-embed-text-v1.5's 0.442.

But let's look at the numbers more carefully.

Breaking Down the Gains

The improvement comes in two stages:

Base model switch (nomic → vanilla gte-modernbert): MRR 0.442 → 0.558 (+26.2%)
Fine-tuning (vanilla → ragtoolina-embed-v1): MRR 0.558 → 0.568 (+1.8%)

The base model choice contributed ~93% of the total MRR improvement. Fine-tuning added the remaining ~7%.

This might sound like fine-tuning barely mattered. But look at it from the other direction: fine-tuning moved MRR from 0.558 to 0.568 with only 477K training pairs and ~2 hours of compute. That's cheap incremental gain. And more importantly, the fine-tuned model drops Recall@5 from 72.4% to 69.0% — a slight regression — while improving MRR. This means it's more confident about its top-ranked results, even if it occasionally drops a borderline file from the top 5.

Category-by-Category Analysis

| Category | ragtoolina MRR | nomic MRR | ragtoolina Recall | nomic Recall | |----------|---------------|-----------|-------------------|--------------| | Exact name | 0.875 | 0.667 | 87.5% | 87.5% | | Semantic | 0.507 | 0.378 | 60.0% | 60.0% | | Cross-language | 0.778 | 0.519 | 100% | 66.7% | | Code pattern | 0.154 | 0.154 | 40.0% | 60.0% | | Error handling | 0.435 | 0.380 | 66.7% | 100% |

Exact Name: The Star

Both models find the right file 87.5% of the time (7 out of 8 queries hit). But ragtoolina-embed-v1 ranks it #1 far more consistently — MRR 0.875 vs 0.667.

Concrete example — query "LlamaServerManager":

ragtoolina-embed-v1:

LlamaServerManager.swift — 0.765 (correct, rank 1)
MCPServerManager.swift — 0.557
LSPServerManager.swift — 0.513

nomic-embed-text-v1.5:

LlamaServerManager.swift — 0.854 (correct, rank 1)
MCPServerManager.swift — 0.839
LSPServerManager.swift — 0.812

Both get it right, but look at the score gaps. ragtoolina-embed-v1 has a 0.208 gap between rank 1 and rank 2. nomic has a 0.015 gap. Our model is far more decisive — it knows which file is the right answer with much higher confidence.

Cross-Language: Perfect Recall

The cross-language category tests whether the model can find Swift code from language-agnostic queries. ragtoolina-embed-v1 achieves 100% Recall@5 here (all 3 queries find their expected files). nomic only hits 66.7%.

Query "batch embedding processing" → expected LocalLlamaEmbedding.swift:

ragtoolina: rank 1, score 0.414
nomic: this file was also found but with less decisive ranking

The fine-tuning on synthetic Swift pairs likely helped here — the model learned what Swift embedding code looks like.

Code Pattern: The Weak Spot

Both models score poorly on code pattern queries (40% Recall, 0.154 MRR for ragtoolina). These queries ask about structural patterns like "Swift struct with Codable" or "NestJS injectable service."

Why? Because these queries describe how code is written, not what it does. Our training data is entirely (description, code) pairs — the descriptions are semantic ("what does this code do?"), not structural ("what pattern does this code use?").

Query "Swift struct with Codable" — expected RAGChunk.swift:

ragtoolina top 5:

Project.swift — 0.519 (has Codable structs, but not the expected file)
TeamProject.swift — 0.490
SearchResult.swift — 0.443
TeamIndex.swift — 0.439
StorageMode.swift — 0.435

The model finds files with Codable structs — just not the one we wanted. All top 5 results are valid Codable structs. The benchmark marks this as a miss, but the model isn't entirely wrong — it just doesn't know which specific Codable struct we're looking for.

This is the strongest argument for hybrid search (vector + BM25). A BM25 search for "RAGChunk Codable" would find the exact file. Combining it with vector search using Reciprocal Rank Fusion would fix this category.

Error Handling: A Regression

nomic achieves 100% Recall on error handling queries; ragtoolina drops to 66.7%. The missed query is "retry logic for failed operations" — expected EmbeddingService.swift, but our model ranks it at position 18.

This is a genuine regression from fine-tuning. The model's code-specific training may have narrowed its understanding of generic concepts like "retry logic." The base model generalized better here.

The Cosine Score Paradox

| Model | Avg Expected Cosine | |-------|-------------------| | nomic-embed-text-v1.5 | 0.624 | | gte-modernbert-base (vanilla) | 0.607 | | ragtoolina-embed-v1 | 0.465 |

The fine-tuned model has lower cosine similarity scores. This looks bad if you're comparing numbers, but it's expected and harmless.

Why this happens: CLS pooling produces embeddings in a different distribution than mean pooling. Contrastive fine-tuning further compresses the distribution — it pushes matching pairs closer and non-matching pairs further apart, but the absolute scale of similarities can shift.

Why it doesn't matter: Retrieval is based on ranking, not absolute scores. If the correct file has cosine 0.465 and the next-best has 0.380, the ranking is correct. You'd only have a problem if you hardcoded a similarity threshold (e.g., "only show results above 0.5"), which you shouldn't do for exactly this reason.

Notable Failures

"MCPServer" → Miss

Expected: MCPServer.swift. Got: MCPServerManager.swift (rank 1, score 0.717).

This is a benchmark design issue, not a model issue. The indexed corpus contains MCPServerManager.swift but not MCPServer.swift (it wasn't in the indexed directory). Both nomic and ragtoolina miss this query — because the expected file doesn't exist in the corpus.

"overlap windows for code splitting" → Rank 29

Expected: ChunkingStrategy.swift (which implements overlapping window splitting). Got: WindowManager.swift (rank 1, score 0.514).

The model latched onto the word "windows" and found the UI window manager instead of the code chunking strategy. This is a classic semantic ambiguity — "windows" means something very different in a UI context vs a text processing context. BM25 wouldn't help here either; only a richer query would.

"health check endpoint" → Rank 27

Expected: LlamaServerManager.swift (which implements health checking for the llama server). Got: UpdateEndpoints.swift (rank 1, score 0.436).

The model found "endpoints" but missed the connection between "health check" and the llama server management code. The health check logic in LlamaServerManager.swift uses terms like checkServerStatus and isReady — not the phrase "health check endpoint." This is a vocabulary mismatch that better training data could address.

Quantization Impact

| Variant | MRR@5 | Recall@5 | Speed | Size | |---------|-------|----------|-------|------| | ragtoolina Q8_0 | 0.568 | 69.0% | 2.2ms | 153 MB | | ragtoolina Q5_K_M | 0.553 | 69.0% | 2.3ms | 113 MB |

Q5_K_M drops MRR by 2.6% (0.568 → 0.553) while saving 26% disk space (153 → 113 MB). Recall is identical. The ranking order shifts very slightly — some borderline files swap positions — but the overall retrieval quality holds.

For a macOS app where disk space matters but isn't critical (153 MB is fine for most users), Q8_0 is the obvious default. Q5_K_M is there for users who want the smallest possible bundle.

Comparison Summary

MRR@5 improvement chain:

nomic-embed-text-v1.5      ████████████████████░░░░░░░░░  0.442 (baseline)
                                     ↓ +26% (model switch)
gte-modernbert-base         ██████████████████████████░░░  0.558 (vanilla)
                                     ↓ +2% (fine-tune)
ragtoolina-embed-v1         ██████████████████████████▌░░  0.568 (production)

The model switch from nomic to gte-modernbert provided the bulk of the improvement. Fine-tuning added incremental gains on top. Together, they deliver a +28.5% MRR improvement over the original baseline.

What the Numbers Don't Show

Benchmarks measure what you test. 29 queries over 97 chunks is enough to reveal patterns but not enough to catch every edge case. Things our benchmark doesn't measure:

Very long files — our chunks cap at ~2000 characters. A 500-line file would be split into multiple chunks, and the benchmark only tests whether one of those chunks ranks highly.
Multi-file queries — "show me all the MCP handlers" should return 5+ files. Our benchmark only checks if one expected file appears.
Negative queries — "there is no authentication module" should return nothing relevant. We don't test for this.
Non-English queries — many developers search in their native language. We only test English.

These gaps are acceptable for our current needs. As the user base grows and we collect real query patterns, we'll expand the benchmark.

Key Takeaways

The base model is 93% of the story. Switching from nomic to gte-modernbert improved MRR by 26%. Fine-tuning added 2%. Choose your base model carefully.
Code pattern queries are hard for pure vector search. Structural queries like "struct with Codable" need hybrid retrieval (vector + BM25).
Cosine similarity scores aren't comparable across models. Different architectures and pooling strategies produce different score distributions. Compare rankings, not numbers.
Q8_0 quantization is nearly lossless for embeddings. The ranking quality difference between FP16 and Q8_0 is negligible. Q5_K_M costs ~3% MRR for 26% smaller files.
Build domain-specific benchmarks. MTEB and CoIR are useful for shortlisting. Your actual retrieval task is what matters.

This post is Part 5 of the ragtoolina-embed-v1 series. See also: Part 1 — Picking the Base Model | Part 2 — Building the Training Dataset | Part 3 — Training on a Rented GPU | Part 4 — GGUF Conversion and Serving