BASE MODEL + DATA FINETUNED Fine-tuning LARGE LANGUAGE MODELS DeepLearning.ai · Sharon Zhou · Lamini · July 2025

Course Notes · Finetuning · July 2025

Finetuning Large Language Models

My notes from the DeepLearning.ai short course by Sharon Zhou (Lamini). Covers what finetuning actually is, when it beats prompting and RAG, how to prepare data, run training, evaluate results, and use LoRA for parameter-efficient finetuning.

Finetuning LLMs PyTorch HuggingFace Lamini · Sharon Zhou · DeepLearning.ai · July 28, 2025

01 What is Finetuning?

Finetuning is the process of taking a general-purpose pre-trained model and specialising it for a specific domain or task. The analogy that Sharon uses — and it really clicked for me — is turning a primary care physician into a specialist.

🩺
The Physician Analogy
A base model is like a primary care physician. Describe skin irritation, redness, and itching — it'll say "probably acne." A model finetuned on dermatology data gives you: "You have a mix of non-inflammatory comedonal acne and inflammatory papulopustular acne." Same question, radically more specific answer.

What does finetuning actually do for the model?

  • Steers the model toward more consistent output
  • Reduces hallucinations on domain-specific content
  • Customises the model to a specific use-case and voice
  • Uses the same training objective as the original pre-training — just with different data

02 Prompting vs Finetuning vs RAG

Before reaching for finetuning, it's worth knowing exactly where it sits relative to prompt engineering and retrieval-augmented generation (RAG). They're not mutually exclusive — in fact, you can and often should combine them.

Prompting Finetuning
Pros No data needed to start
Smaller upfront cost
No technical knowledge
Can connect data via RAG
Nearly unlimited data fits
Learns new information
Corrects incorrect information
Lower cost per request (smaller model)
Can use RAG too
Cons Much less data fits in context
Forgets data between sessions
Hallucinations
RAG can miss or return wrong data
Requires high-quality labelled data
Upfront compute cost
Needs some technical knowledge
Usage Generic, side projects, prototypes Domain-specific, enterprise, production, privacy-sensitive

The mental model I use: prompting is for exploration, RAG is for connecting live external data, and finetuning is for when you need a model that reliably behaves a certain way — every single time.

03 Why Finetune Your Own LLM?

The case for finetuning goes well beyond accuracy. Once you have a finetuned model, you gain control across four dimensions that matter a lot in production.

⚡ Performance
  • Stops hallucinations on domain content
  • Increases consistency and reliability
  • Reduces unwanted or off-topic output
🔒 Privacy
  • Deploy on-prem or in your own VPC
  • Prevent data leakage to third-party APIs
  • No risk of data breaches via external calls
💰 Cost
  • Lower cost per request — finetune a smaller model that matches a larger one's task performance
  • Greater transparency into what you're running
  • Greater control over model behaviour
🛡️ Reliability
  • Control your own uptime SLAs
  • Lower latency — no remote API calls
  • Moderation baked in — guardrails and custom responses

Tools

  • PyTorch — the standard for custom training loops
  • HuggingFace Transformers — open source models, tokenizers, datasets
  • Lamini (Llama library) — abstracts away boilerplate for fast iteration
python · setup
import os import lamini lamini.api_url = os.getenv("POWERML__PRODUCTION__URL") lamini.api_key = os.getenv("POWERML__PRODUCTION__KEY") from llama import BasicModelRunner # Base model — no instruction tuning non_finetuned = BasicModelRunner("meta-llama/Llama-2-7b-hf") non_finetuned_output = non_finetuned("Tell me how to train my dog to sit") # Finetuned (chat) model finetuned_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf") finetuned_output = finetuned_model("Tell me how to train my dog to sit") # Wrap with [INST] tags to avoid autocomplete behaviour finetuned_model("[INST]Tell me how to train my dog to sit[/INST]")

04 Pretraining vs Finetuning Data

Understanding the difference between pretraining and finetuning data is fundamental — they serve completely different purposes and look nothing alike.

Pretraining

A model starts life with zero knowledge. It can't form words, knows nothing about the world. Pretraining teaches it language and knowledge through sheer scale.

  • Next-word prediction on a giant corpus of text
  • Often scraped from the internet — "unlabelled" data
  • Open-source example: "The Pile" — 22 diverse datasets from the internet
  • Expensive and time-consuming — self-supervised learning at massive scale
python · pretraining data sample (C4)
from datasets import load_dataset import itertools pretrained_dataset = load_dataset("c4", "en", split="train", streaming=True) top_n = itertools.islice(pretrained_dataset, 1) for i in top_n: print(i) # Output — raw web text, no structure: # {'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you # want to get better at making delicious BBQ? ...', # 'timestamp': '2019-04-25T12:57:54Z', 'url': '...'}

Finetuning after pre-training

After pretraining, a model knows the world — but it isn't a chatbot yet. Finetuning takes that knowledge and shapes the behaviour. Some key points worth remembering:

  • Uses much less data than pretraining — quality matters far more than quantity
  • Can be self-supervised (unlabelled) or curated labelled pairs
  • Updates the entire model, not just part of it
  • Same objective as pretraining: next token prediction — just on different data

05 What Is Finetuning Actually Doing?

Finetuning drives two types of change inside the model, and it helps to be clear about which one you're targeting before you write a single line of training code.

Behaviour Change
Teach the model to respond more consistently, focus on specific topics (e.g. moderation), or tease out a capability it already has but doesn't show by default — like being better at conversation.
Knowledge Gain
Teach the model new domain-specific facts it wasn't trained on, or correct outdated or incorrect information baked in during pretraining.
Both
Most real-world finetuning does both — domain knowledge + expected output format and tone. A customer support bot, for example, needs domain facts AND a consistent helpful tone.

Tasks to finetune

All finetuning is ultimately text-in, text-out. Tasks break into two families:

  • Extraction (text in, less text out) — reading, keyword extraction, topic classification, routing, agents that reason, plan, self-critique, or use tools
  • Expansion (text in, more text out) — writing, conversation, summarisation, code generation
Task clarity is the key indicator of success. Before writing any training code, make sure you can clearly articulate what "bad", "OK", and "better" look like for your specific task. Vague success criteria produce vague models.

First time finetuning — the practical path

  1. 1
    Identify candidate tasks by prompt engineering a base LLM and watching what it does
  2. 2
    Find tasks the LLM does OK at — you need a performance baseline to improve on
  3. 3
    Pick one task — scope creep kills finetuning experiments
  4. 4
    Get ~1000 input/output pairs for that task — better than the "OK" output the base LLM gives
  5. 5
    Finetune a small model (400M–1B parameters) and evaluate

06 Instruction Finetuning — GPT-3 → ChatGPT

Instruction finetuning is a specific subset of finetuning that teaches a model to follow instructions and behave like a chatbot. This is the technique that turned the raw completion engine of GPT-3 into ChatGPT — and scaled AI adoption from thousands of researchers to hundreds of millions of people.

Where does instruction data come from?

  • Existing datasets — FAQs, customer support threads, Slack messages — anything that's naturally instruction + response shaped
  • Convert your own data — take a README, internal docs, or product description and reformat as Q&A pairs using a prompt template
  • Use another LLM to generate it — the Alpaca technique uses ChatGPT to convert raw text into instruction/response pairs automatically
python · alpaca dataset sample
from datasets import load_dataset instruction_tuned_dataset = load_dataset( "tatsu-lab/alpaca", split="train", streaming=True ) # Structure of one example: { 'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1. Eat a balanced diet...\n2. Exercise regularly...\n3. Get enough sleep...' }

Prompt templates

How you format your data before training matters a lot. There are two standard templates — one for tasks that include additional input context, one for those that don't.

python · prompt templates
prompt_template_with_input = """Below is an instruction that describes \ a task, paired with an input that provides further context. \ Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response:""" prompt_template_without_input = """Below is an instruction that describes \ a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response:"""

Finetuning steps at a glance

Instruction finetuning is an iterative cycle — not a one-shot process:

Data Prep
Collect pairs, format, tokenise, split
Training
Forward pass, loss, backprop, update weights
Evaluation
Human review, benchmarks, error analysis

If evaluation reveals problems, you go back to data prep — more examples, better quality, different formatting — and run the cycle again.

07 Data Preparation

Garbage in, garbage out — nowhere is this truer than in finetuning. Data quality trumps data quantity almost every time.

✅ Higher Quality
  • Errors in your data become errors in your model. Every example should be one you'd be proud of the model replicating.
🌈 Diversity
  • Low diversity leads to memorisation, not generalisation. Vary phrasing, topics, lengths.
📋 Real
  • Real examples from your actual use-case outperform synthetic data, especially for writing tasks.
📈 More (but less critical)
  • More data helps, but it's the fourth priority. Fix quality and diversity first.

Steps to data preparation

  1. 1
    Collect instruction-response pairs
  2. 2
    Concatenate pairs and add a prompt template where applicable
  3. 3
    Tokenise — pad short sequences, truncate long ones
  4. 4
    Split into train and test sets

Tokenisation

Tokenisation converts human-readable strings into the integer sequences the model actually sees. Always use the tokenizer that was paired with your base model — using a mismatched tokenizer will confuse the model at inference time.

python · tokenization with HuggingFace
from transformers import AutoTokenizer # AutoTokenizer finds the right tokenizer automatically tokenizer = AutoTokenizer.from_pretrained('EleutherAI/pythia-70m') # Encode → decode round-trip text = 'Hi!! How are you?' encoded = tokenizer(text)['input_ids'] decoded = tokenizer.decode(encoded) # Padding — set pad token to eos token tokenizer.pad_token = tokenizer.eos_token encoded_padded = tokenizer(list_texts, padding=True) # Truncation — gets rid of everything on the right by default encoded_truncated = tokenizer(list_texts, max_length=3, truncation=True) # Left-side truncation — retains the right side (end of sequence) tokenizer.truncation_side = "left" encoded_left = tokenizer(list_texts, max_length=3, truncation=True)

Storing your data

  • Save as JSONL files — one JSON object per line, easy to stream
  • Upload to HuggingFace Datasets for sharing and reuse
python · save and load finetuning dataset
import jsonlines with jsonlines.open('lamini_docs_processed.jsonl', 'w') as writer: writer.write_all(finetuning_dataset_question_answer) # Or load directly from HuggingFace from datasets import load_dataset finetuning_dataset = load_dataset("lamini/lamini_docs")

08 Training

The training loop itself is standard gradient descent — the same as any neural network. What's different is the data you feed it and the starting point (a pre-trained model rather than random weights).

  1. 1
    Feed a batch of training data into the model
  2. 2
    Predict the next token at each position
  3. 3
    Calculate loss — cross-entropy between prediction and ground truth
  4. 4
    Backpropagate through the entire model
  5. 5
    Update weights via the optimizer

Key hyperparameters

  • Learning rate — how large each gradient step is
  • Learning rate scheduler (LRS) — controls how the learning rate decays over time
  • Optimizer hyperparameters — e.g. beta values in AdamW
python · training loop
for each in range(num_epochs): for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() # update weights

Lamini in 3 lines

If you want to skip the boilerplate and get to results fast, Lamini's library wraps the full training loop into three lines:

python · lamini shortcut
from llama import BasicModelRunner model = BasicModelRunner("EleutherAI/pythia-410m") model.load_data_from_jsonlines("lamini_docs.jsonl", input_key="question", output_key="answer") model.train(is_public=True)

09 Evaluation

Evaluating generative models is genuinely hard. Unlike classification, there's no clean accuracy metric. Human expert evaluation remains the most reliable method, which is expensive and slow — hence the need for proxy benchmarks.

Good test data must be: high quality, accurate, generalised, and not seen in the training data. If your test set leaks into training, you'll get misleadingly good numbers.

Elo-style comparisons are gaining traction — A/B tests or model tournaments where responses are ranked against each other rather than scored absolutely.

LLM Benchmarks (by EleutherAI)

Benchmarks average multiple evaluation methods to give a comparable score across models:

ARC Science QA
Grade school science multiple-choice questions — tests basic reasoning and knowledge recall.
HellaSwag Common Sense
Tests commonsense NLI — given a partial sentence, can the model pick the correct continuation?
MMLU Multitask
Massive Multitask Language Understanding — elementary maths, US history, CS, law, and more.
TruthfulQA Factuality
Measures a model's propensity to reproduce falsehoods commonly found on the internet.
Important: finetuning on a domain-specific task will often drop ARC scores — because the model is being optimised for something ARC doesn't test. That's expected, not a failure. ARC scores recover if you also train on general tasks like Alpaca.

Error Analysis

Before finetuning, spend time understanding what the base model gets wrong. Categorise the failures so you can fix them in your data:

  • Misspellings — surprisingly common in base models
  • Too verbose — model answers with an essay when a sentence would do
  • Repetitive — curb with stop tokens; also ensure your training data doesn't repeat heavily

10 Practical Approach to Finetuning

This is the workflow I'm keeping close to hand. It's deliberately conservative — start small, validate fast, scale only when you have evidence it's working.

  1. 1
    Define your task — be precise about inputs, outputs, and what "good" looks like
  2. 2
    Collect data related to those inputs and outputs
  3. 3
    Generate synthetic data with a larger LLM if you don't have enough real examples
  4. 4
    Finetune a small model (400M–1B parameters) — fast feedback, low cost
  5. 5
    Vary the data volume — learn how sensitive your task is to data quantity
  6. 6
    Evaluate rigorously — human review + relevant benchmarks
  7. 7
    Collect more data and iterate if performance isn't satisfactory
  8. 8
    Increase task complexity gradually
  9. 9
    Scale model size only for genuinely complex tasks — bigger isn't always better

Task complexity vs model size

  • Writing tasks are harder — they require more output tokens, and typically need larger models
  • Harder tasks + multiple simultaneous tasks require larger models

Model size and GPU memory

Training needs far more memory than inference. A 16GB GPU can only train a ~1B parameter model. Here's a quick reference:

AWS Instance GPUs GPU Memory Max inference (params) Max training (tokens)
p3.2xlarge1× V10016GB7B1B
p3.8xlarge4× V10064GB7B1B
p3.16xlarge8× V100128GB7B1B
p3dn.24xlarge8× V100256GB14B2B
p4d.24xlarge8× A100320GB HBM218B2.5B
p4de.24xlarge8× A100640GB HBM2e32B5B

11 LoRA & Parameter-Efficient Finetuning (PEFT)

Full finetuning updates every weight in the model — which is powerful but expensive. Parameter-Efficient Finetuning (PEFT) techniques train only a fraction of the parameters, dramatically reducing memory requirements while keeping most of the accuracy benefit.

🧬
Three families of PEFT
Addition-based — add new layers/adapters and train only those.
Selection-based — freeze most weights, select a sparse subset to update.
Reparametrisation-based (LoRA) — decompose weight updates into low-rank matrices and train those instead.

LoRA — Low-Rank Adaptation

LoRA is the most widely used PEFT technique. Instead of modifying the original weights directly, it trains a pair of small matrices whose product approximates the weight update. At inference, these are merged back into the main weights — so there's no additional latency compared to the full finetuned model.

  • For GPT-3, trainable parameters reduced by 10,000× vs full finetuning
  • Accuracy is slightly below full finetuning — an acceptable tradeoff in most cases
  • Same inference latency — LoRA weights merge with base weights at inference
  • Train new low-rank matrices, freeze the main weights
  • The new weights are rank decomposition matrices of the original weights' change
  • Train separately, merge at inference time
When to use LoRA: any time you want to finetune a model larger than your GPU can accommodate for full finetuning. It's also great for rapid experimentation — train multiple LoRA adapters for different tasks and swap them at inference time without keeping multiple full model copies.