Part 2: Building the Training Dataset — CodeSearchNet + Synthetic Pairs
The Data Strategy
Fine-tuning an embedding model for code search is fundamentally a contrastive learning problem. You need pairs: a natural language query and the code snippet it should match. At training time, the model learns to push matching pairs closer together in embedding space and non-matching pairs further apart.
We had two data sources:
- CodeSearchNet — a large open-source dataset of (docstring, function) pairs across 6 languages
- Synthetic pairs — template-generated (query, code) pairs from our own Swift and TypeScript codebase
Neither source alone would have been enough. Here's why.
Source 1: CodeSearchNet
CodeSearchNet is the standard dataset for training code retrieval models. Originally published by GitHub in 2019, it contains ~2 million (docstring, function_body) pairs across six languages: Python, JavaScript, Go, Java, Ruby, and PHP.
The logic is simple: a function's docstring is a natural language description of what the function does. Pair them up, and you have training data for a model that maps natural language queries to code.
Downloading in 2026
Here's the first gotcha we hit. The HuggingFace datasets library (v4.x+) dropped support for the CodeSearchNet loading script. If you try:
from datasets import load_dataset
ds = load_dataset("code_search_net") # Fails in datasets >= 4.x
You'll get an error about a missing loading script. The fix is to download the raw archives directly:
from huggingface_hub import hf_hub_download
zip_path = hf_hub_download(
"code-search-net/code_search_net",
f"data/{language}.zip",
repo_type="dataset",
)
Each language archive contains gzipped JSONL shards organized by split (train/, test/, valid/). We only use the training split. Our prepare_data.py opens the zip, finds all train/*.jsonl.gz files, and streams the records:
with zipfile.ZipFile(zip_path) as zf:
train_files = sorted(
n for n in zf.namelist()
if "/train/" in n and n.endswith(".jsonl.gz")
)
for tf in train_files:
with zf.open(tf) as f:
with gzip.open(f) as gz:
for line in gz:
items.append(json.loads(line))
Filtering
Raw CodeSearchNet is noisy. Many records have empty docstrings, one-line functions, or docstrings that are just TODO or @deprecated. We applied three filters:
if not docstring or not code:
continue
if len(docstring) < 10 or len(code) < 50:
continue
if len(docstring) > 500 or len(code) > 2000:
continue
Why these thresholds?
- Docstring 10–500 chars: Below 10, it's usually a stub (
"TODO","fixme"). Above 500, it's often a full API reference copy-pasted into the docstring — too verbose to resemble a real search query. - Code 50–2000 chars: Below 50, it's a trivial getter/setter that doesn't teach the model anything useful. Above 2000, it's a mega-function that would exceed our chunk size in production (Ragtoolina chunks code at 30–500 tokens, roughly 120–2000 characters).
After filtering, we shuffled with seed=42 and took 500,000 pairs. The final split: 475,000 for training, 25,000 for evaluation (5% held out).
Language Distribution
The filtering is not uniform across languages. Python and JavaScript have the most docstrings (they're culturally expected). Go and Java are well-represented. Ruby and PHP have fewer pairs but still contribute meaningfully.
We didn't balance the languages artificially. The natural distribution is fine for our use case — most Ragtoolina users work in Python, JavaScript, or TypeScript, and the model needs strongest performance there.
Source 2: Synthetic Pairs (Swift + TypeScript)
Here's the problem: CodeSearchNet covers Python, JavaScript, Go, Java, Ruby, and PHP. It does not cover:
- TypeScript — the most popular typed JS variant, used heavily in modern web development
- Swift — the language our own macOS app is written in
Our users absolutely search Swift and TypeScript codebases. We needed training data for these languages, but there's no public equivalent of CodeSearchNet for them.
The Template-Based Approach
Instead of using an LLM to generate queries (which would require API costs and introduce hallucination risks), we built a template-based generator that extracts queries from the code itself.
The pipeline has three stages:
Stage 1: Read and Chunk
Walk the project directories and split files into meaningful chunks:
swift_dirs = ["/Users/nikitastogniy/Work/RAG/RAGtool/RAGtool"]
ts_dirs = ["/Users/nikitastogniy/Work/RAG/rag-back/src"]
Files are split by regex-matching language-specific boundaries:
- Swift:
func,class,struct,enum,protocolkeywords - TypeScript:
function,class,interface,type,const = asyncpatterns
Each chunk is capped at 1500 characters. If a file doesn't match any splitting patterns (e.g., a config file), it's split into 1000-character windows.
Stage 2: Generate Queries
For each chunk, we generate queries from four sources:
Filename-based queries:
EmbeddingService.swift → "how to embedding service"
LlamaServerManager.swift → "how to llama server manager"
The camelCase_to_words function splits CamelCase names and lowercases them. The "how to" prefix makes it resemble a real developer query.
Function name queries:
func_matches = re.findall(r'func\s+(\w+)', content) # Swift
func parseSymbols() → "parse symbols"
func startServer() → "start server"
We take up to 3 function names per chunk and skip names shorter than 3 characters (common iterator variables like i, j, k).
Type name queries:
class MCPServerManager → "m c p server manager implementation"
struct RAGChunk → "r a g chunk implementation"
The word "implementation" is appended because developers often search for "X implementation" when they want the concrete code.
Comment-based queries:
doc_patterns = [
r'///\s*(.+)', # Swift doc comments
r'/\*\*\s*\n?\s*\*?\s*(.+)', # JSDoc
r'//\s*(.+)', # Inline comments
]
Doc comments are the closest thing to natural language descriptions that already exist in the code. A comment like /// Splits source code into semantic chunks is almost exactly the kind of query a developer would type.
We filter comments to 10–200 characters (too short = noise, too long = paragraph-level documentation that doesn't resemble a search query).
Stage 3: Deduplicate
Many chunks share function names or similar comments. We deduplicate by query text:
seen = set()
unique_pairs = []
for p in pairs:
if p["query"] not in seen:
seen.add(p["query"])
unique_pairs.append(p)
Results
From the Ragtoolina Swift app (~97 source files) and TypeScript backend (~40 source files):
| Language | Pairs | |----------|-------| | Swift | ~1,500 | | TypeScript | ~900 | | Total | 2,385 |
That's 2,385 unique (query, code) pairs. A tiny fraction of the 475K CodeSearchNet pairs — just 0.5% of the total training data.
Why It Matters Despite Being Small
You might wonder: does 0.5% of training data even matter? Yes, for two reasons:
1. Language coverage. Without these pairs, the model has never seen Swift or TypeScript during fine-tuning. It would still work (the base model was pre-trained on diverse text), but the fine-tuning stage would only reinforce patterns from Python/JavaScript/Go/Java/Ruby/PHP. The synthetic pairs ensure Swift and TypeScript get representation in the contrastive learning objective.
2. Domain-specific vocabulary. Our synthetic pairs contain terms like MCPServer, TreeSitterParser, RAGChunk, Qdrant, llama-server — the exact vocabulary that Ragtoolina users search for. CodeSearchNet has never seen these terms. Even 2,400 pairs with this vocabulary give the model a signal about what "code search in the Ragtoolina context" looks like.
Combining the Sources
The training script merges both sources simply:
train_pairs = load_pairs("training/data/train.json") # CodeSearchNet
synthetic_path = config.get("synthetic_pairs")
if os.path.exists(synthetic_path):
synthetic = load_pairs(synthetic_path)
train_pairs.extend(synthetic)
print(f"Added {len(synthetic)} synthetic pairs")
No weighting, no oversampling. The synthetic pairs are appended to the end of the training data. With MultipleNegativesRankingLoss and random batching, they get mixed naturally into the training loop.
The final numbers:
| Source | Pairs | % of Total | |--------|-------|-----------| | CodeSearchNet (train) | 475,000 | 99.5% | | Synthetic (Swift + TS) | 2,385 | 0.5% | | Total training | 477,385 | 100% | | CodeSearchNet (eval) | 25,000 | — |
What We'd Do Differently
LLM-Generated Queries
The template approach is fast and free, but the queries it produces are sometimes awkward: "how to m c p server manager" is not how anyone would actually search. An LLM could generate more natural queries:
- Template:
"how to llama server manager" - LLM:
"start and manage the local llama.cpp embedding server"
Our generate_synthetic.py has a placeholder for this. We'll likely use Claude in a future iteration, generating 5 queries per code chunk and keeping the best ones.
Hard Negatives
All our training uses in-batch negatives (from MNR loss). We didn't mine hard negatives — pairs where the code is related but not the right answer. For example, "start the embedding server" → LocalLlamaEmbedding.swift (wrong, the answer is LlamaServerManager.swift) would be a hard negative.
Hard negative mining would likely improve the model's ability to distinguish between related files in the same module. But it adds complexity to the data pipeline, and our results were already sufficient without it.
More Languages
TypeScript and Swift aren't the only languages missing from CodeSearchNet. Rust, Kotlin, Dart, C#, and many others are absent. If our user base grows into those ecosystems, we'd need to generate synthetic pairs for them too.
Data Preparation Pipeline Summary
CodeSearchNet (HuggingFace)
│
├─ Download 6 language zips via hf_hub_download
├─ Parse .jsonl.gz shards from each zip
├─ Filter: 10 < docstring < 500, 50 < code < 2000
├─ Shuffle (seed=42), take 500K
└─ Split: 95% train (475K), 5% eval (25K)
Ragtoolina Source Code
│
├─ Walk Swift + TypeScript directories
├─ Split files into chunks by function/class boundaries
├─ Generate queries: filename, function name, type name, comments
├─ Deduplicate by query text
└─ Output: 2,385 synthetic pairs
Merge:
train.json (475K) + synthetic_pairs.json (2.4K) → 477,385 training pairs
eval.json (25K) → held out for InformationRetrievalEvaluator