Fine-Tuning a Code Embedding Model That Runs Entirely on Your Mac
The Problem
Ragtoolina is a macOS menu bar app that semantically indexes your codebase locally — zero cloud, zero API keys. It parses your project with TreeSitter, chunks code into symbols, embeds them with a local model via llama-server (llama.cpp with Metal GPU), and stores vectors in a local Qdrant instance. Claude, Cursor, or any MCP client can then search your code semantically.
We started with nomic-embed-text-v1.5, a solid general-purpose embedding model. It worked, but "general-purpose" is the key phrase — it was never optimized for code retrieval. When a developer asks "how to start the embedding server," the model needs to understand that LlamaServerManager.swift is the answer, not UpdateService.swift.
We wanted to see how much a targeted fine-tune could improve code search quality while staying within our constraints: GGUF format, 768 dimensions, ≤300 MB, and fast enough to index thousands of files without the user noticing.
Picking the Base Model
We evaluated five candidates:
| Model | Params | CoIR (code NDCG@10) | Dims | Notes | |-------|--------|---------------------|------|-------| | gte-modernbert-base | 149M | 79.31 | 768 | ModernBERT arch, no prefixes needed | | nomic-embed-text-v1.5 | 137M | 71.2 | 768 | Our incumbent | | snowflake-arctic-embed-m | 109M | 74.8 | 768 | Good but smaller | | bge-base-en-v1.5 | 109M | 68.5 | 768 | Older BERT arch | | jina-embeddings-v2-base-code | 137M | 76.4 | 768 | Code-specific but needs prefixes |
Alibaba-NLP/gte-modernbert-base won on three fronts:
- Highest CoIR score — already the best at code retrieval before any fine-tuning
- ModernBERT architecture — RoPE positional encodings and GeGLU activations, a newer and more efficient foundation than classic BERT
- No query/document prefixes required — nomic needs
search_query:/search_document:prefixes on every call; gte-modernbert uses CLS pooling and works without them, simplifying our integration
The critical requirement was GGUF support. ModernBERT conversion was merged into llama.cpp in PR #15641 on Dec 22, 2025 — just in time for our project.
The Training Data
CodeSearchNet (~475K pairs)
The backbone of our training data is CodeSearchNet — 2M+ (docstring, function_body) pairs across six languages: Python, JavaScript, Go, Java, Ruby, and PHP. We took 500K pairs after filtering:
- Both docstring and code present
- Docstring: 10–500 characters (no empty stubs, no novels)
- Code: 50–2000 characters (no one-liners, no megafunctions)
One gotcha: the HuggingFace datasets library dropped the CodeSearchNet loading script in v4.x. Our prepare_data.py downloads the raw .zip archives via hf_hub_download and parses the .jsonl.gz shards directly.
Synthetic Pairs (~2,400 pairs)
CodeSearchNet covers Python, JavaScript, Go, Java, Ruby, and PHP. It does not cover TypeScript or Swift — the two languages Ragtoolina users primarily work with. So we generated synthetic training pairs from our own production codebase.
The approach is entirely template-based (no LLM API needed):
- Walk the source directories of the Ragtoolina Swift app and TypeScript backend
- Split files into chunks by regex-matching function/class/struct/protocol boundaries
- Generate queries from four sources:
- Filenames:
EmbeddingService.swift→"how to embedding service" - Function names:
func parseSymbols()→"parse symbols" - Type names:
struct RAGChunk→"r a g chunk implementation" - Doc comments:
/// Splits source code into semantic chunks→ used directly as a query
- Filenames:
- Deduplicate by query text
This yielded 2,385 unique (query, code) pairs. It's a small fraction of the total training data, but it teaches the model what Swift and TypeScript code looks like — something 475K Python/JavaScript pairs alone can't do.
Training Setup
Framework
- sentence-transformers v3.0+ with
SentenceTransformerTrainer - Loss:
MultipleNegativesRankingLoss(MNR) — the standard approach for bi-encoder contrastive learning. For each (query, positive_code) pair in a batch, all other code snippets become negatives. No manually curated hard negatives needed. - Evaluation:
InformationRetrievalEvaluatortracking MRR@5, MRR@10, NDCG@5, NDCG@10, and Accuracy@1/5, evaluated every 200 steps. Best checkpoint selected bycosine_ndcg@10.
Hyperparameters
base_model: Alibaba-NLP/gte-modernbert-base
epochs: 3
batch_size: 16 # overridden to 64 on A100
learning_rate: 2e-5
warmup_ratio: 0.05
max_grad_norm: 1.0
bf16: true
Running on a Rented GPU
The model has 149M parameters and the dataset is ~477K pairs. Training on a MacBook M4 Pro (MPS backend) would take 4–6 hours. We rented an NVIDIA A100 on Lambda Labs instead.
The workflow was a single shell script (training/cloud/run_on_lambda.sh):
1. SSH into the Lambda instance
2. Upload train.py, config.yaml, and pre-prepared JSON data files
3. Install dependencies (torch, sentence-transformers, etc.)
4. Patch config: batch_size 16→64, device mps→cuda
5. Launch training with nohup
6. When done, SCP the model back
Total training time: ~1–2 hours on a single A100. The entire Lambda session (including setup, upload, training, and download) cost about the price of a coffee.
One practical tip: we pre-prepared the data locally (prepare_data.py downloads and filters CodeSearchNet, generate_synthetic.py creates the Swift/TS pairs) and uploaded the resulting JSON files to the GPU instance. This avoids installing HuggingFace datasets and waiting for CodeSearchNet to download on the remote machine.
GGUF Conversion
After training, the model comes back as HuggingFace safetensors. We need GGUF for llama.cpp.
# Step 1: safetensors → GGUF FP16 (286 MB)
python3 vendor/llama.cpp/convert_hf_to_gguf.py \
models/ragtoolina-embed-v1-hf \
--outfile models/ragtoolina-embed-v1-f16.gguf \
--outtype f16
# Step 2: FP16 → Q8_0 (153 MB) — production
llama-quantize f16.gguf q8_0.gguf Q8_0
# Step 3: FP16 → Q5_K_M (113 MB) — lightweight
llama-quantize f16.gguf q5km.gguf Q5_K_M
Three quantization levels:
| Variant | Size | Use case | |---------|------|----------| | FP16 | 286 MB | Reference, lossless | | Q8_0 | 153 MB | Production — ships in the .app bundle | | Q5_K_M | 113 MB | Lightweight option, minimal quality loss |
For embedding models, Q8_0 preserves virtually all ranking quality. Below Q5 you start to see degradation. Q5_K_M is our safety net for users with limited disk space.
Serving with llama.cpp
The model runs inside the Ragtoolina .app bundle via llama-server:
llama-server \
--model ragtoolina-embed-v1-q8_0.gguf \
--port 8384 \
--embeddings \
--pooling cls \
-ngl 99 \
--ctx-size 8192 \
--ubatch-size 8192 \
--batch-size 8192
Key flags:
--embeddings— expose/v1/embeddingsendpoint (OpenAI-compatible)--pooling cls— ModernBERT uses CLS token pooling (not mean pooling like nomic)-ngl 99— offload all layers to Metal GPU- Context/batch at 8192 — matches the model's max sequence length
The Swift app spawns llama-server as a child process, sends batches of 12 code chunks at a time, and gets back 768-dimensional vectors at 2.2ms per chunk on Apple Silicon.
Benchmarks
We built a custom benchmark: 29 hand-crafted queries across 5 categories, searched against 97 real code chunks from the Ragtoolina Swift project. Categories: exact name lookups ("LlamaServerManager"), semantic queries ("split code into chunks for embedding"), cross-language queries, code pattern matching, and error handling.
Overall Results
| Model | MRR@5 | Recall@5 | Size | Speed | |-------|-------|----------|------|-------| | nomic-embed-text-v1.5 Q8 | 0.442 | 69.0% | 153 MB | 1.9ms | | gte-modernbert-base Q8 (vanilla) | 0.558 | 72.4% | 153 MB | 2.2ms | | ragtoolina-embed-v1 Q8 | 0.568 | 69.0% | 153 MB | 2.2ms | | ragtoolina-embed-v1 Q5_K_M | 0.553 | 69.0% | 113 MB | 2.3ms |
+28.5% MRR improvement over nomic-embed-text-v1.5. The fine-tuned model not only finds the right files more often — it ranks them higher.
Category Breakdown
| Category | Recall@5 | MRR@5 | |----------|----------|-------| | Exact name lookup | 87.5% | 0.875 | | Semantic search | 60.0% | 0.507 | | Cross-language | 100% | 0.778 | | Code patterns | 40.0% | 0.154 | | Error handling | 66.7% | 0.435 |
Exact name matching is where the model shines — searching "LlamaServerManager" returns LlamaServerManager.swift as the #1 result with 0.765 cosine similarity. Both models get 87.5% recall here, but our fine-tuned model achieves 0.875 MRR vs nomic's 0.667, meaning it ranks the right file at position #1 far more consistently.
Cross-language is a perfect 100% recall. Queries like "vector similarity search" find the right Swift files even though the concept is language-agnostic.
Code patterns (40% recall) are the weak spot — queries like "Swift struct with Codable" or "NestJS injectable service" don't match well because the model was trained on semantic queries, not structural code patterns. We plan to address this with hybrid BM25 search.
A Note on Cosine Similarity Scores
The fine-tuned model reports lower average cosine similarity (0.465) compared to nomic (0.624). This looks alarming if you're comparing numbers, but it's expected and benign. CLS pooling + contrastive fine-tuning compresses the embedding distribution differently than mean pooling. What matters is the ranking order, not the absolute scores — and MRR proves our rankings are better.
What We Learned
1. The base model matters more than the fine-tune. The vanilla gte-modernbert-base (no fine-tuning) already beats nomic by +26% MRR. Our fine-tune added another +2% on top. Picking the right architecture and pre-training gives you 90% of the gains.
2. Synthetic data is cheap but valuable. 2,400 template-generated Swift/TypeScript pairs out of 477K total — that's 0.5% of the training data. Yet it's what teaches the model languages that CodeSearchNet doesn't cover. The template approach (regex-based chunk splitting + heuristic query generation) took an afternoon to build and needs no API costs.
3. MNR loss is all you need for this scale. No hard negative mining, no curriculum learning, no multi-stage training. A single pass of MultipleNegativesRankingLoss over 477K pairs for 3 epochs was enough. The in-batch negatives from a batch size of 64 on A100 provide sufficient contrast.
4. Q8_0 quantization is practically lossless for embeddings. Comparing F16 → Q8_0, we see negligible MRR change. Q5_K_M drops ~2.7% MRR. For a 30% size reduction (153 MB → 113 MB), that's acceptable as a lightweight option but not worth it as the default.
5. Renting a GPU is the right call for one-shot training. M4 Pro MPS works but would take 4–6 hours. A Lambda A100 costs a few dollars for 1–2 hours of training and lets you iterate faster with larger batch sizes (64 vs 16). The scripts to SSH in, upload data, train, and download results are ~60 lines of bash.
The Full Pipeline
CodeSearchNet (6 langs) ─┐
├─ 477K (query, code) pairs
Synthetic (Swift + TS) ──┘
│
▼
sentence-transformers
MNR loss, 3 epochs
A100 GPU (~1–2 hours)
│
▼
HuggingFace safetensors
│
▼
llama.cpp convert_hf_to_gguf.py
│
┌────┴────────┐
▼ ▼
Q8_0 Q5_K_M
153 MB 113 MB
│
▼
llama-server (Metal GPU)
--embeddings --pooling cls
│
▼
Ragtoolina.app
2.2ms per chunk, 768d vectors
stored in local Qdrant
What's Next
- Hybrid search: Combine vector similarity with BM25 full-text search using Qdrant's native RRF (Reciprocal Rank Fusion). This should fix the code_pattern category where pure semantic search struggles.
- More synthetic data: Use an LLM (Claude) to generate higher-quality (query, code) pairs instead of templates. The infrastructure is already there — our
generate_synthetic.pyhas a placeholder for API-based generation. - Incremental indexing: Symbol-level content hashing so re-indexing after a
git pullonly processes changed files.
ragtoolina-embed-v1 is a fine-tuned Alibaba-NLP/gte-modernbert-base (149M params, 768d, Apache 2.0) trained on CodeSearchNet + synthetic Swift/TypeScript pairs, served as a 153 MB GGUF via llama.cpp with Metal GPU acceleration.