Finetuning Large Language Models

01 What is Finetuning?

Finetuning is the process of taking a general-purpose pre-trained model and specialising it for a specific domain or task. The analogy that Sharon uses — and it really clicked for me — is turning a primary care physician into a specialist.

🩺

The Physician Analogy

A base model is like a primary care physician. Describe skin irritation, redness, and itching — it'll say "probably acne." A model finetuned on dermatology data gives you: "You have a mix of non-inflammatory comedonal acne and inflammatory papulopustular acne." Same question, radically more specific answer.

What does finetuning actually do for the model?

Steers the model toward more consistent output
Reduces hallucinations on domain-specific content
Customises the model to a specific use-case and voice
Uses the same training objective as the original pre-training — just with different data

02 Prompting vs Finetuning vs RAG

Before reaching for finetuning, it's worth knowing exactly where it sits relative to prompt engineering and retrieval-augmented generation (RAG). They're not mutually exclusive — in fact, you can and often should combine them.

	Prompting	Finetuning
Pros	No data needed to start Smaller upfront cost No technical knowledge Can connect data via RAG	Nearly unlimited data fits Learns new information Corrects incorrect information Lower cost per request (smaller model) Can use RAG too
Cons	Much less data fits in context Forgets data between sessions Hallucinations RAG can miss or return wrong data	Requires high-quality labelled data Upfront compute cost Needs some technical knowledge
Usage	Generic, side projects, prototypes	Domain-specific, enterprise, production, privacy-sensitive

The mental model I use: prompting is for exploration, RAG is for connecting live external data, and finetuning is for when you need a model that reliably behaves a certain way — every single time.

03 Why Finetune Your Own LLM?

The case for finetuning goes well beyond accuracy. Once you have a finetuned model, you gain control across four dimensions that matter a lot in production.

⚡ Performance

Stops hallucinations on domain content
Increases consistency and reliability
Reduces unwanted or off-topic output

🔒 Privacy

Deploy on-prem or in your own VPC
Prevent data leakage to third-party APIs
No risk of data breaches via external calls

💰 Cost

Lower cost per request — finetune a smaller model that matches a larger one's task performance
Greater transparency into what you're running
Greater control over model behaviour

🛡️ Reliability

Control your own uptime SLAs
Lower latency — no remote API calls
Moderation baked in — guardrails and custom responses

Tools

PyTorch — the standard for custom training loops
HuggingFace Transformers — open source models, tokenizers, datasets
Lamini (Llama library) — abstracts away boilerplate for fast iteration

python · setup

import os
import lamini

lamini.api_url = os.getenv("POWERML__PRODUCTION__URL")
lamini.api_key = os.getenv("POWERML__PRODUCTION__KEY")

from llama import BasicModelRunner

# Base model — no instruction tuning
non_finetuned = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_finetuned_output = non_finetuned("Tell me how to train my dog to sit")

# Finetuned (chat) model
finetuned_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")
finetuned_output = finetuned_model("Tell me how to train my dog to sit")

# Wrap with [INST] tags to avoid autocomplete behaviour
finetuned_model("[INST]Tell me how to train my dog to sit[/INST]")

04 Pretraining vs Finetuning Data

Understanding the difference between pretraining and finetuning data is fundamental — they serve completely different purposes and look nothing alike.

Pretraining

A model starts life with zero knowledge. It can't form words, knows nothing about the world. Pretraining teaches it language and knowledge through sheer scale.

Next-word prediction on a giant corpus of text
Often scraped from the internet — "unlabelled" data
Open-source example: "The Pile" — 22 diverse datasets from the internet
Expensive and time-consuming — self-supervised learning at massive scale

python · pretraining data sample (C4)

from datasets import load_dataset
import itertools

pretrained_dataset = load_dataset("c4", "en", split="train", streaming=True)
top_n = itertools.islice(pretrained_dataset, 1)
for i in top_n:
    print(i)

# Output — raw web text, no structure:
# {'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you
#  want to get better at making delicious BBQ? ...',
#  'timestamp': '2019-04-25T12:57:54Z', 'url': '...'}

Finetuning after pre-training

After pretraining, a model knows the world — but it isn't a chatbot yet. Finetuning takes that knowledge and shapes the behaviour. Some key points worth remembering:

Uses much less data than pretraining — quality matters far more than quantity
Can be self-supervised (unlabelled) or curated labelled pairs
Updates the entire model, not just part of it
Same objective as pretraining: next token prediction — just on different data

05 What Is Finetuning Actually Doing?

Finetuning drives two types of change inside the model, and it helps to be clear about which one you're targeting before you write a single line of training code.

Behaviour Change

Teach the model to respond more consistently, focus on specific topics (e.g. moderation), or tease out a capability it already has but doesn't show by default — like being better at conversation.

Knowledge Gain

Teach the model new domain-specific facts it wasn't trained on, or correct outdated or incorrect information baked in during pretraining.

Both

Most real-world finetuning does both — domain knowledge + expected output format and tone. A customer support bot, for example, needs domain facts AND a consistent helpful tone.

Tasks to finetune

All finetuning is ultimately text-in, text-out. Tasks break into two families:

Extraction (text in, less text out) — reading, keyword extraction, topic classification, routing, agents that reason, plan, self-critique, or use tools
Expansion (text in, more text out) — writing, conversation, summarisation, code generation

Task clarity is the key indicator of success. Before writing any training code, make sure you can clearly articulate what "bad", "OK", and "better" look like for your specific task. Vague success criteria produce vague models.

First time finetuning — the practical path

1
Identify candidate tasks by prompt engineering a base LLM and watching what it does
2
Find tasks the LLM does OK at — you need a performance baseline to improve on
3
Pick one task — scope creep kills finetuning experiments
4
Get ~1000 input/output pairs for that task — better than the "OK" output the base LLM gives
5
Finetune a small model (400M–1B parameters) and evaluate

06 Instruction Finetuning — GPT-3 → ChatGPT

Instruction finetuning is a specific subset of finetuning that teaches a model to follow instructions and behave like a chatbot. This is the technique that turned the raw completion engine of GPT-3 into ChatGPT — and scaled AI adoption from thousands of researchers to hundreds of millions of people.

Where does instruction data come from?

Existing datasets — FAQs, customer support threads, Slack messages — anything that's naturally instruction + response shaped
Convert your own data — take a README, internal docs, or product description and reformat as Q&A pairs using a prompt template
Use another LLM to generate it — the Alpaca technique uses ChatGPT to convert raw text into instruction/response pairs automatically

python · alpaca dataset sample

from datasets import load_dataset
instruction_tuned_dataset = load_dataset(
    "tatsu-lab/alpaca", split="train", streaming=True
)

# Structure of one example:
{
  'instruction': 'Give three tips for staying healthy.',
  'input': '',
  'output': '1. Eat a balanced diet...\n2. Exercise regularly...\n3. Get enough sleep...'
}

Prompt templates

How you format your data before training matters a lot. There are two standard templates — one for tasks that include additional input context, one for those that don't.

python · prompt templates

prompt_template_with_input = """Below is an instruction that describes \
a task, paired with an input that provides further context. \
Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

prompt_template_without_input = """Below is an instruction that describes \
a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

Finetuning steps at a glance

Instruction finetuning is an iterative cycle — not a one-shot process:

Data Prep

Collect pairs, format, tokenise, split

Training

Forward pass, loss, backprop, update weights

Evaluation

Human review, benchmarks, error analysis

If evaluation reveals problems, you go back to data prep — more examples, better quality, different formatting — and run the cycle again.

07 Data Preparation

Garbage in, garbage out — nowhere is this truer than in finetuning. Data quality trumps data quantity almost every time.

✅ Higher Quality

Errors in your data become errors in your model. Every example should be one you'd be proud of the model replicating.

🌈 Diversity

Low diversity leads to memorisation, not generalisation. Vary phrasing, topics, lengths.

📋 Real

Real examples from your actual use-case outperform synthetic data, especially for writing tasks.

📈 More (but less critical)

More data helps, but it's the fourth priority. Fix quality and diversity first.

Steps to data preparation

1
Collect instruction-response pairs
2
Concatenate pairs and add a prompt template where applicable
3
Tokenise — pad short sequences, truncate long ones
4
Split into train and test sets

Tokenisation

Tokenisation converts human-readable strings into the integer sequences the model actually sees. Always use the tokenizer that was paired with your base model — using a mismatched tokenizer will confuse the model at inference time.

python · tokenization with HuggingFace

from transformers import AutoTokenizer

# AutoTokenizer finds the right tokenizer automatically
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/pythia-70m')

# Encode → decode round-trip
text = 'Hi!! How are you?'
encoded = tokenizer(text)['input_ids']
decoded = tokenizer.decode(encoded)

# Padding — set pad token to eos token
tokenizer.pad_token = tokenizer.eos_token
encoded_padded = tokenizer(list_texts, padding=True)

# Truncation — gets rid of everything on the right by default
encoded_truncated = tokenizer(list_texts, max_length=3, truncation=True)

# Left-side truncation — retains the right side (end of sequence)
tokenizer.truncation_side = "left"
encoded_left = tokenizer(list_texts, max_length=3, truncation=True)

Storing your data

Save as JSONL files — one JSON object per line, easy to stream
Upload to HuggingFace Datasets for sharing and reuse

python · save and load finetuning dataset

import jsonlines

with jsonlines.open('lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

# Or load directly from HuggingFace
from datasets import load_dataset
finetuning_dataset = load_dataset("lamini/lamini_docs")

08 Training

The training loop itself is standard gradient descent — the same as any neural network. What's different is the data you feed it and the starting point (a pre-trained model rather than random weights).

1
Feed a batch of training data into the model
2
Predict the next token at each position
3
Calculate loss — cross-entropy between prediction and ground truth
4
Backpropagate through the entire model
5
Update weights via the optimizer

Key hyperparameters

Learning rate — how large each gradient step is
Learning rate scheduler (LRS) — controls how the learning rate decays over time
Optimizer hyperparameters — e.g. beta values in AdamW

python · training loop

for each in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()    # update weights

Lamini in 3 lines

If you want to skip the boilerplate and get to results fast, Lamini's library wraps the full training loop into three lines:

python · lamini shortcut

from llama import BasicModelRunner

model = BasicModelRunner("EleutherAI/pythia-410m")
model.load_data_from_jsonlines("lamini_docs.jsonl", input_key="question", output_key="answer")
model.train(is_public=True)

09 Evaluation

Evaluating generative models is genuinely hard. Unlike classification, there's no clean accuracy metric. Human expert evaluation remains the most reliable method, which is expensive and slow — hence the need for proxy benchmarks.

Good test data must be: high quality, accurate, generalised, and not seen in the training data. If your test set leaks into training, you'll get misleadingly good numbers.

Elo-style comparisons are gaining traction — A/B tests or model tournaments where responses are ranked against each other rather than scored absolutely.

LLM Benchmarks (by EleutherAI)

Benchmarks average multiple evaluation methods to give a comparable score across models:

ARC Science QA

Grade school science multiple-choice questions — tests basic reasoning and knowledge recall.

HellaSwag Common Sense

Tests commonsense NLI — given a partial sentence, can the model pick the correct continuation?

MMLU Multitask

Massive Multitask Language Understanding — elementary maths, US history, CS, law, and more.

TruthfulQA Factuality

Measures a model's propensity to reproduce falsehoods commonly found on the internet.

Important: finetuning on a domain-specific task will often drop ARC scores — because the model is being optimised for something ARC doesn't test. That's expected, not a failure. ARC scores recover if you also train on general tasks like Alpaca.

Error Analysis

Before finetuning, spend time understanding what the base model gets wrong. Categorise the failures so you can fix them in your data:

Misspellings — surprisingly common in base models
Too verbose — model answers with an essay when a sentence would do
Repetitive — curb with stop tokens; also ensure your training data doesn't repeat heavily

10 Practical Approach to Finetuning

This is the workflow I'm keeping close to hand. It's deliberately conservative — start small, validate fast, scale only when you have evidence it's working.

1
Define your task — be precise about inputs, outputs, and what "good" looks like
2
Collect data related to those inputs and outputs
3
Generate synthetic data with a larger LLM if you don't have enough real examples
4
Finetune a small model (400M–1B parameters) — fast feedback, low cost
5
Vary the data volume — learn how sensitive your task is to data quantity
6
Evaluate rigorously — human review + relevant benchmarks
7
Collect more data and iterate if performance isn't satisfactory
8
Increase task complexity gradually
9
Scale model size only for genuinely complex tasks — bigger isn't always better

Task complexity vs model size

Writing tasks are harder — they require more output tokens, and typically need larger models
Harder tasks + multiple simultaneous tasks require larger models

Model size and GPU memory

Training needs far more memory than inference. A 16GB GPU can only train a ~1B parameter model. Here's a quick reference:

AWS Instance	GPUs	GPU Memory	Max inference (params)	Max training (tokens)
p3.2xlarge	1× V100	16GB	7B	1B
p3.8xlarge	4× V100	64GB	7B	1B
p3.16xlarge	8× V100	128GB	7B	1B
p3dn.24xlarge	8× V100	256GB	14B	2B
p4d.24xlarge	8× A100	320GB HBM2	18B	2.5B
p4de.24xlarge	8× A100	640GB HBM2e	32B	5B

11 LoRA & Parameter-Efficient Finetuning (PEFT)

Full finetuning updates every weight in the model — which is powerful but expensive. Parameter-Efficient Finetuning (PEFT) techniques train only a fraction of the parameters, dramatically reducing memory requirements while keeping most of the accuracy benefit.

🧬

Three families of PEFT

Addition-based — add new layers/adapters and train only those.
Selection-based — freeze most weights, select a sparse subset to update.
Reparametrisation-based (LoRA) — decompose weight updates into low-rank matrices and train those instead.

LoRA — Low-Rank Adaptation

LoRA is the most widely used PEFT technique. Instead of modifying the original weights directly, it trains a pair of small matrices whose product approximates the weight update. At inference, these are merged back into the main weights — so there's no additional latency compared to the full finetuned model.

For GPT-3, trainable parameters reduced by 10,000× vs full finetuning
Accuracy is slightly below full finetuning — an acceptable tradeoff in most cases
Same inference latency — LoRA weights merge with base weights at inference
Train new low-rank matrices, freeze the main weights
The new weights are rank decomposition matrices of the original weights' change
Train separately, merge at inference time

When to use LoRA: any time you want to finetune a model larger than your GPU can accommodate for full finetuning. It's also great for rapid experimentation — train multiple LoRA adapters for different tasks and swap them at inference time without keeping multiple full model copies.