01 What is Finetuning?
Finetuning is the process of taking a general-purpose pre-trained model and specialising it for a specific domain or task. The analogy that Sharon uses — and it really clicked for me — is turning a primary care physician into a specialist.
What does finetuning actually do for the model?
- Steers the model toward more consistent output
- Reduces hallucinations on domain-specific content
- Customises the model to a specific use-case and voice
- Uses the same training objective as the original pre-training — just with different data
02 Prompting vs Finetuning vs RAG
Before reaching for finetuning, it's worth knowing exactly where it sits relative to prompt engineering and retrieval-augmented generation (RAG). They're not mutually exclusive — in fact, you can and often should combine them.
| Prompting | Finetuning | |
|---|---|---|
| Pros |
No data needed to start Smaller upfront cost No technical knowledge Can connect data via RAG |
Nearly unlimited data fits Learns new information Corrects incorrect information Lower cost per request (smaller model) Can use RAG too |
| Cons |
Much less data fits in context Forgets data between sessions Hallucinations RAG can miss or return wrong data |
Requires high-quality labelled data Upfront compute cost Needs some technical knowledge |
| Usage | Generic, side projects, prototypes | Domain-specific, enterprise, production, privacy-sensitive |
The mental model I use: prompting is for exploration, RAG is for connecting live external data, and finetuning is for when you need a model that reliably behaves a certain way — every single time.
03 Why Finetune Your Own LLM?
The case for finetuning goes well beyond accuracy. Once you have a finetuned model, you gain control across four dimensions that matter a lot in production.
- Stops hallucinations on domain content
- Increases consistency and reliability
- Reduces unwanted or off-topic output
- Deploy on-prem or in your own VPC
- Prevent data leakage to third-party APIs
- No risk of data breaches via external calls
- Lower cost per request — finetune a smaller model that matches a larger one's task performance
- Greater transparency into what you're running
- Greater control over model behaviour
- Control your own uptime SLAs
- Lower latency — no remote API calls
- Moderation baked in — guardrails and custom responses
Tools
- PyTorch — the standard for custom training loops
- HuggingFace Transformers — open source models, tokenizers, datasets
- Lamini (Llama library) — abstracts away boilerplate for fast iteration
04 Pretraining vs Finetuning Data
Understanding the difference between pretraining and finetuning data is fundamental — they serve completely different purposes and look nothing alike.
Pretraining
A model starts life with zero knowledge. It can't form words, knows nothing about the world. Pretraining teaches it language and knowledge through sheer scale.
- Next-word prediction on a giant corpus of text
- Often scraped from the internet — "unlabelled" data
- Open-source example: "The Pile" — 22 diverse datasets from the internet
- Expensive and time-consuming — self-supervised learning at massive scale
Finetuning after pre-training
After pretraining, a model knows the world — but it isn't a chatbot yet. Finetuning takes that knowledge and shapes the behaviour. Some key points worth remembering:
- Uses much less data than pretraining — quality matters far more than quantity
- Can be self-supervised (unlabelled) or curated labelled pairs
- Updates the entire model, not just part of it
- Same objective as pretraining: next token prediction — just on different data
05 What Is Finetuning Actually Doing?
Finetuning drives two types of change inside the model, and it helps to be clear about which one you're targeting before you write a single line of training code.
Tasks to finetune
All finetuning is ultimately text-in, text-out. Tasks break into two families:
- Extraction (text in, less text out) — reading, keyword extraction, topic classification, routing, agents that reason, plan, self-critique, or use tools
- Expansion (text in, more text out) — writing, conversation, summarisation, code generation
First time finetuning — the practical path
- 1Identify candidate tasks by prompt engineering a base LLM and watching what it does
- 2Find tasks the LLM does OK at — you need a performance baseline to improve on
- 3Pick one task — scope creep kills finetuning experiments
- 4Get ~1000 input/output pairs for that task — better than the "OK" output the base LLM gives
- 5Finetune a small model (400M–1B parameters) and evaluate
06 Instruction Finetuning — GPT-3 → ChatGPT
Instruction finetuning is a specific subset of finetuning that teaches a model to follow instructions and behave like a chatbot. This is the technique that turned the raw completion engine of GPT-3 into ChatGPT — and scaled AI adoption from thousands of researchers to hundreds of millions of people.
Where does instruction data come from?
- Existing datasets — FAQs, customer support threads, Slack messages — anything that's naturally instruction + response shaped
- Convert your own data — take a README, internal docs, or product description and reformat as Q&A pairs using a prompt template
- Use another LLM to generate it — the Alpaca technique uses ChatGPT to convert raw text into instruction/response pairs automatically
Prompt templates
How you format your data before training matters a lot. There are two standard templates — one for tasks that include additional input context, one for those that don't.
Finetuning steps at a glance
Instruction finetuning is an iterative cycle — not a one-shot process:
If evaluation reveals problems, you go back to data prep — more examples, better quality, different formatting — and run the cycle again.
07 Data Preparation
Garbage in, garbage out — nowhere is this truer than in finetuning. Data quality trumps data quantity almost every time.
- Errors in your data become errors in your model. Every example should be one you'd be proud of the model replicating.
- Low diversity leads to memorisation, not generalisation. Vary phrasing, topics, lengths.
- Real examples from your actual use-case outperform synthetic data, especially for writing tasks.
- More data helps, but it's the fourth priority. Fix quality and diversity first.
Steps to data preparation
- 1Collect instruction-response pairs
- 2Concatenate pairs and add a prompt template where applicable
- 3Tokenise — pad short sequences, truncate long ones
- 4Split into train and test sets
Tokenisation
Tokenisation converts human-readable strings into the integer sequences the model actually sees. Always use the tokenizer that was paired with your base model — using a mismatched tokenizer will confuse the model at inference time.
Storing your data
- Save as JSONL files — one JSON object per line, easy to stream
- Upload to HuggingFace Datasets for sharing and reuse
08 Training
The training loop itself is standard gradient descent — the same as any neural network. What's different is the data you feed it and the starting point (a pre-trained model rather than random weights).
- 1Feed a batch of training data into the model
- 2Predict the next token at each position
- 3Calculate loss — cross-entropy between prediction and ground truth
- 4Backpropagate through the entire model
- 5Update weights via the optimizer
Key hyperparameters
- Learning rate — how large each gradient step is
- Learning rate scheduler (LRS) — controls how the learning rate decays over time
- Optimizer hyperparameters — e.g. beta values in AdamW
Lamini in 3 lines
If you want to skip the boilerplate and get to results fast, Lamini's library wraps the full training loop into three lines:
09 Evaluation
Evaluating generative models is genuinely hard. Unlike classification, there's no clean accuracy metric. Human expert evaluation remains the most reliable method, which is expensive and slow — hence the need for proxy benchmarks.
Elo-style comparisons are gaining traction — A/B tests or model tournaments where responses are ranked against each other rather than scored absolutely.
LLM Benchmarks (by EleutherAI)
Benchmarks average multiple evaluation methods to give a comparable score across models:
Error Analysis
Before finetuning, spend time understanding what the base model gets wrong. Categorise the failures so you can fix them in your data:
- Misspellings — surprisingly common in base models
- Too verbose — model answers with an essay when a sentence would do
- Repetitive — curb with stop tokens; also ensure your training data doesn't repeat heavily
10 Practical Approach to Finetuning
This is the workflow I'm keeping close to hand. It's deliberately conservative — start small, validate fast, scale only when you have evidence it's working.
- 1Define your task — be precise about inputs, outputs, and what "good" looks like
- 2Collect data related to those inputs and outputs
- 3Generate synthetic data with a larger LLM if you don't have enough real examples
- 4Finetune a small model (400M–1B parameters) — fast feedback, low cost
- 5Vary the data volume — learn how sensitive your task is to data quantity
- 6Evaluate rigorously — human review + relevant benchmarks
- 7Collect more data and iterate if performance isn't satisfactory
- 8Increase task complexity gradually
- 9Scale model size only for genuinely complex tasks — bigger isn't always better
Task complexity vs model size
- Writing tasks are harder — they require more output tokens, and typically need larger models
- Harder tasks + multiple simultaneous tasks require larger models
Model size and GPU memory
Training needs far more memory than inference. A 16GB GPU can only train a ~1B parameter model. Here's a quick reference:
| AWS Instance | GPUs | GPU Memory | Max inference (params) | Max training (tokens) |
|---|---|---|---|---|
| p3.2xlarge | 1× V100 | 16GB | 7B | 1B |
| p3.8xlarge | 4× V100 | 64GB | 7B | 1B |
| p3.16xlarge | 8× V100 | 128GB | 7B | 1B |
| p3dn.24xlarge | 8× V100 | 256GB | 14B | 2B |
| p4d.24xlarge | 8× A100 | 320GB HBM2 | 18B | 2.5B |
| p4de.24xlarge | 8× A100 | 640GB HBM2e | 32B | 5B |
11 LoRA & Parameter-Efficient Finetuning (PEFT)
Full finetuning updates every weight in the model — which is powerful but expensive. Parameter-Efficient Finetuning (PEFT) techniques train only a fraction of the parameters, dramatically reducing memory requirements while keeping most of the accuracy benefit.
Selection-based — freeze most weights, select a sparse subset to update.
Reparametrisation-based (LoRA) — decompose weight updates into low-rank matrices and train those instead.
LoRA — Low-Rank Adaptation
LoRA is the most widely used PEFT technique. Instead of modifying the original weights directly, it trains a pair of small matrices whose product approximates the weight update. At inference, these are merged back into the main weights — so there's no additional latency compared to the full finetuned model.
- For GPT-3, trainable parameters reduced by 10,000× vs full finetuning
- Accuracy is slightly below full finetuning — an acceptable tradeoff in most cases
- Same inference latency — LoRA weights merge with base weights at inference
- Train new low-rank matrices, freeze the main weights
- The new weights are rank decomposition matrices of the original weights' change
- Train separately, merge at inference time