Part 1: Picking the Right Base Model for Code Embeddings

February 20, 2026·Ragtoolina Team

Why We Needed a New Model

Ragtoolina is a macOS menu bar app that indexes your entire codebase locally. No cloud, no API keys. The pipeline works like this:

Source files → TreeSitter parser → Semantic symbols → Chunks → Embedding → Qdrant

We ship a GGUF embedding model inside the .app bundle and run it through llama-server with Metal GPU acceleration. Up until now, that model was nomic-embed-text-v1.5 — a fine general-purpose embedding model. It handled most queries decently, but it wasn't trained with code retrieval in mind.

The pain point was subtle but consistent. A query like "how to start the embedding server" would return EmbeddingService.swift and LocalLlamaEmbedding.swift — related files, but not the file that actually starts the server (LlamaServerManager.swift). The model understood the words but missed the code-specific semantics.

We decided to evaluate alternatives, with the option to fine-tune whichever base model performed best.

The Constraints

Before looking at models, we locked in the non-negotiable requirements:

| Constraint | Why | |-----------|-----| | GGUF format | We serve through llama-server (llama.cpp). No Python, no ONNX, no CoreML. | | 768 dimensions | Our existing Qdrant collections use 768d. Changing dimensions means re-indexing every user's projects. | | ≤ 300 MB (quantized) | Ships inside a macOS .app bundle. Users download it once. | | 8192 context tokens | Code chunks are 30–500 tokens, but some symbol metadata pushes the total higher. We need headroom. | | No mandatory prefixes (nice-to-have) | nomic requires search_query: and search_document: on every request. It works but adds complexity. | | Apache 2.0 or equivalent | We're shipping it commercially. No GPL, no restrictive academic licenses. |

The 768-dimension constraint was the tightest filter. Many newer models use 1024d or 1536d — great for benchmarks, unusable for us without a full Qdrant migration.

The Candidates

We shortlisted five models, all encoder-only (bi-encoder architecture), all 768-dimensional:

1. nomic-embed-text-v1.5 (incumbent)

Parameters: 137M
Architecture: nomic-bert (classic BERT with rotary embeddings)
CoIR (code retrieval NDCG@10): 71.2
MTEB average: 62.28
Prefixes: Required (search_query: / search_document:)
GGUF: Native support, battle-tested
License: Apache 2.0

The known quantity. Works reliably, well-documented GGUF support, but its code retrieval score is the lowest of the bunch.

2. Alibaba-NLP/gte-modernbert-base

Parameters: 149M
Architecture: ModernBERT (RoPE + GeGLU + Flash Attention)
CoIR: 79.31
MTEB average: 64.38
Prefixes: None needed (CLS pooling)
GGUF: Supported since llama.cpp PR #15641 (Dec 22, 2025)
License: Apache 2.0

The highest CoIR score by a significant margin. ModernBERT is a 2024 architecture that replaces BERT's absolute positional encodings with RoPE (Rotary Position Embeddings), uses GeGLU activations instead of GELU, and supports Flash Attention. It's a more modern foundation that extrapolates better to longer sequences.

3. snowflake-arctic-embed-m

Parameters: 109M
Architecture: BERT-base
CoIR: 74.8
MTEB average: 62.74
Prefixes: Query prefix required
GGUF: Supported
License: Apache 2.0

Smaller and fast. Good CoIR score for its size, but the classic BERT architecture limits how much headroom we'd have for fine-tuning.

4. BAAI/bge-base-en-v1.5

Parameters: 109M
Architecture: BERT-base
CoIR: 68.5
MTEB average: 63.55
Prefixes: Query prefix required
GGUF: Supported
License: MIT

A solid workhorse from BAAI. Strong on general retrieval tasks (high MTEB average), but the lowest CoIR score. Code isn't its strength.

5. jina-embeddings-v2-base-code

Parameters: 137M
Architecture: JinaBERT (ALiBi attention)
CoIR: 76.4
MTEB average: 60.12
Prefixes: None needed
GGUF: Uncertain (ALiBi attention has had llama.cpp compatibility issues)
License: Apache 2.0

Purpose-built for code. Trained on code-specific data including StackOverflow and GitHub. The CoIR score is competitive, but the GGUF conversion was uncertain at the time — ALiBi attention isn't as well-supported as RoPE in the llama.cpp ecosystem.

The Decision Matrix

| Model | CoIR | MTEB | Dims | Context | Prefixes | GGUF | License | |-------|------|------|------|---------|----------|------|---------| | nomic-embed-text-v1.5 | 71.2 | 62.28 | 768 | 8192 | Required | Stable | Apache | | gte-modernbert-base | 79.31 | 64.38 | 768 | 8192 | None | Stable | Apache | | snowflake-arctic-embed-m | 74.8 | 62.74 | 768 | 512 | Required | Stable | Apache | | bge-base-en-v1.5 | 68.5 | 63.55 | 768 | 512 | Required | Stable | MIT | | jina-v2-base-code | 76.4 | 60.12 | 768 | 8192 | None | Risky | Apache |

Why gte-modernbert-base Won

1. Best-in-class Code Retrieval

CoIR 79.31 vs the next best (jina at 76.4). That's a meaningful gap on a benchmark where models are tightly clustered. CoIR measures NDCG@10 across code retrieval tasks — exactly what we're building.

2. Zero Prefixes

This sounds minor but matters in practice. With nomic, every embedding request requires prepending search_query: or search_document: to the input text. That's extra string concatenation on every chunk during indexing and every query at search time. It's also a source of bugs — forget the prefix and your search quality degrades silently.

gte-modernbert uses CLS token pooling. The model architecture itself distinguishes between input representations without needing prefix tokens. One fewer moving part.

3. ModernBERT Architecture

RoPE positional encodings mean the model genuinely understands sequence positions up to 8192 tokens, rather than interpolating from a shorter training length. GeGLU activations have been shown to improve representation quality in both language models and embedding models. These aren't just architectural niceties — they give us better headroom for fine-tuning.

4. Identical Infrastructure Footprint

Same dimensions (768), same context window (8192), same GGUF support, same approximate file size after quantization. Switching from nomic to gte-modernbert requires changing a filename and a --pooling flag. No Qdrant schema migration, no client-side changes.

5. Confirmed GGUF Support

The ModernBERT GGUF conversion was merged into llama.cpp on December 22, 2025 (PR #15641). We verified it ourselves before committing to the model — built llama.cpp from source, converted the model, ran smoke tests (health check, embedding dimension check, batch test). Everything passed.

The Fallback Plan

Our fallback was CodeRankEmbed — a 137M parameter model using the nomic-bert architecture (guaranteed GGUF compatibility) but pre-trained specifically on code retrieval. If gte-modernbert had GGUF issues, we'd fall back to CodeRankEmbed with confidence in the conversion pipeline.

We didn't need the fallback. The conversion worked first try.

Validating the Choice

Before any fine-tuning, we ran our custom benchmark (29 queries, 97 code chunks, 5 categories) on both the vanilla gte-modernbert-base and nomic-embed-text-v1.5, both quantized to Q8_0:

| Model | MRR@5 | Recall@5 | |-------|-------|----------| | nomic-embed-text-v1.5 Q8 | 0.442 | 69.0% | | gte-modernbert-base Q8 (vanilla, no fine-tune) | 0.558 | 72.4% |

A +26% MRR improvement — from the base model alone, with zero fine-tuning. This confirmed two things:

The model choice matters more than the fine-tune. Getting the right foundation gives you most of the gains.
There's still room to improve with targeted training data (spoiler: the fine-tune adds another +2% MRR).

Key Takeaway

If you're building a code retrieval system in early 2026, gte-modernbert-base is the model to beat in the 768d / ≤150M parameter class. The ModernBERT architecture, high CoIR score, no-prefix design, and confirmed GGUF support make it an obvious choice for local-first applications.

The real lesson, though, is to benchmark against your own data before deciding. CoIR scores are a useful filter, but your specific retrieval task — the kinds of queries your users write, the structure of the code they search — is what ultimately matters. We benchmarked on 29 queries from our actual product, and the results matched the CoIR rankings almost perfectly.

Next: Part 2 — Building the Training Dataset