Part 3: Training on a Rented GPU
Why Not Train Locally?
Our development machine is a MacBook Pro with an M4 Pro chip. Apple Silicon can train PyTorch models via the MPS (Metal Performance Shaders) backend, and it works — we tested it. But "works" and "practical" are different things.
The math:
- 477K training pairs, 3 epochs = ~1.4M training steps
- Batch size 16 on MPS (limited by unified memory bandwidth)
- Estimated time: 4–6 hours on M4 Pro
On an A100:
- Batch size 64 (80GB VRAM gives plenty of headroom for a 149M parameter model)
- Estimated time: 1–2 hours
The batch size difference matters beyond just speed. With MultipleNegativesRankingLoss, each training step uses the other items in the batch as negative examples. Batch size 64 gives the model 63 negatives per step vs 15 with batch size 16. More negatives = sharper contrastive signal = better embeddings.
Renting an A100 on Lambda Labs costs a few dollars per hour. Total cost for our training run: less than the price of a coffee.
The Training Script
The core training code is straightforward sentence-transformers v3.0+:
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
# Load base model
model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
# MNR loss: in-batch negatives contrastive learning
loss = losses.MultipleNegativesRankingLoss(model)
# Training arguments
training_args = SentenceTransformerTrainingArguments(
output_dir="models/ragtoolina-embed-v1",
num_train_epochs=3,
per_device_train_batch_size=64, # 16 on MPS, 64 on A100
learning_rate=2e-5,
warmup_ratio=0.05,
max_grad_norm=1.0,
bf16=True,
eval_strategy="steps",
eval_steps=200,
save_steps=200,
save_total_limit=5,
logging_steps=50,
load_best_model_at_end=True,
metric_for_best_model="code-retrieval-eval_cosine_ndcg@10",
)
trainer = SentenceTransformerTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
Let's unpack the key decisions.
MultipleNegativesRankingLoss
MNR loss is the standard for training bi-encoder embedding models when you only have positive pairs (query, relevant_code). You don't need to curate negative examples — the loss function takes every other positive in the batch and treats it as a negative.
For a batch of 64 pairs:
- Pair 0: query_0 should match code_0
- Pair 0: query_0 should NOT match code_1, code_2, ..., code_63
- Same for all other pairs
The loss pushes cosine(query_i, code_i) up and cosine(query_i, code_j) down for all j ≠ i. With batch size 64, that's 63 negatives per step — enough for the model to learn meaningful distinctions.
Learning Rate: 2e-5
This is on the higher end for fine-tuning. The default for BERT fine-tuning is typically 2e-6 to 5e-6. We used 2e-5 because:
- We're not training from scratch — the base model already has strong code representations
- We want the model to adapt quickly to our data distribution
- With warmup_ratio=0.05, the first 5% of steps ramp up gradually, preventing early instability
If the learning rate were too high, we'd see the evaluation metric oscillating or degrading. With our eval_steps=200, we'd catch that early.
bf16
bfloat16 mixed precision. Halves memory usage compared to fp32, which is critical for fitting batch_size=64 on a single GPU. The A100 has native bf16 support with no speed penalty.
We use bf16 even on MPS (M4 Pro supports it natively). If your GPU doesn't support bf16, switch to fp16 — for embedding models, the difference is negligible.
Evaluation During Training
Every 200 steps, the trainer runs an InformationRetrievalEvaluator on 1000 held-out pairs:
evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant_docs,
name="code-retrieval-eval",
mrr_at_k=[5, 10],
ndcg_at_k=[5, 10],
accuracy_at_k=[1, 5],
)
This evaluator builds a mini-retrieval system: embed all queries and all corpus documents, compute cosine similarity, rank, and measure MRR/NDCG/Accuracy. The best checkpoint is selected by cosine_ndcg@10.
We cap the eval set at 1000 pairs (from 25K available) to keep evaluation fast. On an A100, evaluating 1000 pairs takes ~10 seconds. With 25K pairs, it would be ~4 minutes every 200 steps — too slow.
Checkpoint Management
save_steps=200,
save_total_limit=5,
load_best_model_at_end=True,
A checkpoint is saved every 200 steps, keeping only the 5 most recent. When training finishes, the trainer automatically loads whichever checkpoint had the best cosine_ndcg@10 score. This protects against overfitting — if epoch 3 degrades quality, we automatically fall back to the best checkpoint from epoch 2.
The Cloud Workflow
Training locally requires no infrastructure setup. Training on a rented GPU requires some orchestration. We kept it simple: a single bash script that SSHes into a Lambda Labs instance, uploads data, runs training, and downloads the result.
Step 0: Prepare Data Locally
Before touching any GPU, we run data preparation on the local machine:
python training/prepare_data.py # Download & filter CodeSearchNet
python training/generate_synthetic.py # Generate Swift/TS pairs
This produces three JSON files:
training/data/train.json(475K pairs, ~180MB)training/data/eval.json(25K pairs, ~10MB)training/data/synthetic_pairs.json(2.4K pairs, ~3MB)
Why prepare locally? Two reasons:
- Avoid installing HuggingFace datasets on the GPU instance. CodeSearchNet download requires
hf_hub_download, which pulls ~2GB of zip archives. Doing this on a rented GPU wastes billable time. - Reproducibility. The data files are deterministic (seed=42). We can inspect them, version them, re-use them across training runs.
Step 1: Launch and Upload
#!/bin/bash
LAMBDA_HOST="${1:?Usage: $0 <lambda-host-or-ip>}"
SSH_CMD="ssh -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no ubuntu@${LAMBDA_HOST}"
SCP_CMD="scp -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no"
# Create remote structure
${SSH_CMD} "mkdir -p ~/ragtoolina-train/training/data ~/ragtoolina-train/models"
# Upload scripts and data
${SCP_CMD} training/train.py "ubuntu@${LAMBDA_HOST}:~/ragtoolina-train/training/"
${SCP_CMD} training/config.yaml "ubuntu@${LAMBDA_HOST}:~/ragtoolina-train/training/"
${SCP_CMD} training/data/train.json "ubuntu@${LAMBDA_HOST}:~/ragtoolina-train/training/data/"
${SCP_CMD} training/data/eval.json "ubuntu@${LAMBDA_HOST}:~/ragtoolina-train/training/data/"
${SCP_CMD} training/data/synthetic_pairs.json "ubuntu@${LAMBDA_HOST}:~/ragtoolina-train/training/data/"
Step 2: Install Dependencies and Patch Config
# Install deps
${SSH_CMD} "pip install -q torch transformers sentence-transformers \
datasets huggingface-hub numpy tqdm pyyaml"
# Patch config for CUDA: batch_size 16→64, device mps→cuda
${SSH_CMD} "cd ~/ragtoolina-train && \
sed -i 's/batch_size: 16/batch_size: 64/' training/config.yaml && \
sed -i 's/device: \"mps\"/device: \"cuda\"/' training/config.yaml"
The sed patches are crude but effective. The config file defaults to MPS for local development. On the GPU instance, we switch to CUDA and bump the batch size.
Step 3: Train
${SSH_CMD} "cd ~/ragtoolina-train && \
nohup python training/train.py > training/train.log 2>&1 &"
nohup ensures training continues if the SSH connection drops. We monitor via:
ssh ubuntu@${LAMBDA_HOST} 'tail -f ~/ragtoolina-train/training/train.log'
And check GPU utilization:
ssh ubuntu@${LAMBDA_HOST} 'nvidia-smi'
Step 4: Download the Model
When training completes, the final model is saved as HuggingFace format (safetensors + config):
scp -r ubuntu@${LAMBDA_HOST}:~/ragtoolina-train/models/ragtoolina-embed-v1-final/ \
models/ragtoolina-embed-v1-hf/
The download is ~600MB (fp32 safetensors). We convert to GGUF locally.
Training Log Analysis
The actual training log from our run:
=== Training Alibaba-NLP/gte-modernbert-base ===
Output: models/ragtoolina-embed-v1
Epochs: 3, Batch: 16, LR: 2e-05
Added 2385 synthetic pairs
Train: 477385, Eval: 25000
(Note: the log shows batch 16 because this particular run was on MPS. The A100 run used batch 64 with the same result.)
The training completed successfully. The load_best_model_at_end=True setting automatically selected the best checkpoint based on evaluation metrics.
Hyperparameter Choices We Considered
Epochs: 3
For contrastive learning with MNR loss, 1–3 epochs is typical. More epochs risk overfitting — the model starts memorizing specific (query, code) associations instead of learning general code retrieval patterns.
We started conservative (config default: 1 epoch) and bumped to 3 after seeing that evaluation metrics were still improving at the end of epoch 1.
Why Not Use InfoNCE or Triplet Loss?
MNR loss is functionally equivalent to InfoNCE loss with in-batch negatives. We could have used losses.CachedMultipleNegativesRankingLoss for even larger effective batch sizes (it computes gradients in chunks), but at batch_size=64, standard MNR was sufficient.
Triplet loss requires explicit (anchor, positive, negative) triplets. We'd need to mine hard negatives, which adds a data preparation step we wanted to avoid.
Why Not Multi-Stage Training?
Some embedding model papers use a two-stage approach: first train with weak pairs (title → passage), then fine-tune with hard pairs (query → passage with curated negatives). This works for large-scale training (millions of pairs), but for our 477K pairs over 3 epochs, a single stage was enough.
Cost Breakdown
| Item | Cost | Time | |------|------|------| | Lambda A100 instance | ~$1.50/hr | 2 hours | | Data upload (SCP) | — | 5 min | | Model download (SCP) | — | 5 min | | Total | ~$3 | ~2.5 hours |
Compare to training on M4 Pro MPS:
- Time: 4–6 hours
- Cost: $0 (but your laptop is a space heater for half a day)
- Batch size: 16 (vs 64 on A100, fewer in-batch negatives)
The GPU rental pays for itself in time savings alone. And the larger batch size likely produces a better model.
What We'd Do Differently
Weights & Biases Integration
We didn't set up W&B or TensorBoard for this run. The training log and checkpoint evaluation were enough for a single training run. But if we were iterating on hyperparameters (learning rate sweeps, epoch counts), proper experiment tracking would be essential.
Gradient Accumulation
If we wanted to simulate even larger batch sizes without more GPU memory, we could use gradient accumulation:
training_args = SentenceTransformerTrainingArguments(
gradient_accumulation_steps=4, # Effective batch = 64 * 4 = 256
...
)
Effective batch size 256 would give 255 negatives per step. But for 477K pairs, we didn't see evidence that larger batches would help further.
Multi-GPU Training
Lambda offers multi-GPU instances. SentenceTransformerTrainer supports distributed training out of the box. For our model size (149M params) and data size (477K pairs), a single A100 was more than sufficient. Multi-GPU would matter if we scaled to the full 2M CodeSearchNet pairs.