01 The Problem with Supervised Fine-tuning
For years, the default recipe for teaching an LLM a new skill was Supervised Fine-Tuning (SFT): assemble thousands of labelled prompt–response pairs, run forward and backward passes to minimise prediction error, repeat. It works well for classification, named entity recognition, and straightforward code generation.
But SFT has two structural weaknesses that become painful at scale. First, it requires thousands of high-quality labelled examples — difficult and expensive to collect, especially for reasoning-heavy tasks where even defining the "correct" answer is non-trivial. Second, it encourages overfitting: the model memorises patterns from the training distribution rather than developing generalizable reasoning strategies.
For simple tasks you can show the model a math problem and its final answer, and it will learn to generalise. For complex multi-step problems, you can include <think></think> reasoning traces alongside <answer></answer> tags to teach both output format and step-by-step reasoning simultaneously. But even then, generating high-quality reasoning traces at scale remains a significant bottleneck.
02 Reinforcement Learning — The Core Idea
In Reinforcement Learning, an agent learns by interacting with an environment and optimising for a reward signal — rather than mimicking fixed labelled examples. The classic intuition: training a puppy. The puppy (agent) performs a trick (action), you give or withhold a treat (reward), and the puppy updates its behaviour based on what earned the treat.
Applied to LLMs, the loop looks like this: a prompt from the environment is fed to the model, which generates a token sequence as its action. That response is evaluated by a scoring function that produces a reward signal. The model weights are updated to maximise future rewards. The process repeats — on new examples or the same ones.
03 RLHF and DPO — Human Preference Methods
Reinforcement Learning with Human Feedback (RLHF)
RLHF is the process that powers ChatGPT. The key idea: instead of a hand-crafted reward function, use human annotators to define what "good" means. RLHF runs in four stages:
Direct Preference Optimization (DPO)
DPO simplifies RLHF by skipping the separate reward model entirely. Instead of training a reward model and then applying PPO, DPO directly fine-tunes the LLM on preference pairs — prompts labelled with a chosen response and a rejected response. The algorithm maximises the probability of generating chosen responses relative to rejected ones.
- 1Generate 2 responses to each prompt
- 2Get human feedback — which response do annotators prefer? (thumbs up/down)
- 3Build preference dataset — (prompt, chosen, rejected) triplets
- 4Apply the DPO algorithm to push up probability of chosen, push down probability of rejected
Both RLHF and DPO rely on human preference rather than verifiable ground truth — and both share a critical limitation:
| Challenge | RLHF | DPO |
|---|---|---|
| Data Needed | Ranked generations for reward model | Paired preference labels (A > B) — large volume needed |
| Compute/Memory | Very High | Moderate |
| Training Stability | Often unstable — reward hacking, collapse risk | More stable, but needs lots of labels |
| Limitation | Doesn't teach new tasks — only steers preferences | Doesn't teach new tasks — only steers preferences |
04 GRPO — The DeepSeek Approach
Group Relative Policy Optimization (GRPO) is the algorithm behind DeepSeek-R1. It sidesteps the need for human preference labels entirely by replacing them with programmable reward functions — deterministic functions that score responses on verifiable criteria like correct answers, valid code, or proper formatting.
GRPO's key insight: instead of asking "which response do humans prefer?", ask "which responses are above average within this batch?" The algorithm samples multiple responses to a single prompt, scores them all, then pushes up the probability of above-average responses and down the probability of below-average ones — all relative to the group mean.
How GRPO is different
The crucial difference from RLHF and DPO: no human feedback, no reward model. The reward function is a piece of code you write. For tasks with verifiable outputs — math problems, code, formatted text — this is a dramatic simplification. GRPO runs the loop directly on the reward function, enabling fine-tuning with as few as 10–20 examples.
- 1Send prompt and sample multiple responses (temperature ~0.7 for diversity)
- 2Score each response with a programmable reward function — no human annotators needed
- 3Compute advantages — normalise scores within the group (subtract mean, divide by std)
- 4Update weights — push up above-average responses, push down below-average ones
- 5Repeat — on new examples or the same ones. Each iteration makes the model a bit better
05 Benefits of RFT — and When to Use It
- No labelled data required — you only need a means to verify correctness, not a labelled dataset
- Works with very few examples — as few as 10, but scales well with more prompts
- More flexible than SFT — learns from feedback rather than fixed examples, so it generalises better
- Enables organic reasoning improvement — the model discovers better chain-of-thought strategies on its own
Tasks suited for RFT
- Mathematical problem solving — reward functions can check numeric correctness precisely
- Code generation and debugging — run the code and reward based on test passage
- Logical and multi-step reasoning — tasks requiring a sequence of decisions where pattern matching fails
Decision Guide — RFT vs SFT vs RLHF
06 Designing Reward Functions — The Wordle Case Study
The course uses Wordle as a running example: guess a secret 5-letter word in 6 tries or fewer, with feedback after each guess. This is a perfect testbed for GRPO because the feedback is entirely verifiable — no human judgement required.
| Feedback | Meaning | Symbol |
|---|---|---|
| 🟩 Green | Correct letter, correct position | ✔ |
| 🟨 Yellow | Correct letter, wrong position | — |
| ⬛ Grey | Letter not in the word | ✗ |
Secret word: POUND
Reward Function 1 — Binary (and why it fails)
The simplest reward: 1 if the guess is exactly correct, 0 otherwise. Intuitive, but it produces a critical flaw — every wrong guess gets the same score of 0, so GRPO has no signal to distinguish a near-miss from a completely wrong answer.
Reward Function 2 — Partial Credit (the fix)
Adding partial credit — rewarding correct letters in correct positions more, correct letters in wrong positions less — gives GRPO the gradient it needs. Now responses are meaningfully differentiated.
Advantage Calculation
With scores in hand, GRPO computes the advantage for each response — how much better or worse it is than the group average, normalised by standard deviation:
Partial credit in action — secret word POUND, temperature 0.7:
| # | Guess | Reward | Advantage |
|---|---|---|---|
| 0 | FOUND | 0.8 | +1.3525 |
| 1 | NOUDI | 0.6 | +0.6312 |
| 2 | FOUND | 0.8 | +1.3525 |
| 3 | WORD | 0.0 | −1.5328 |
| 4 | CROWN | 0.2 | −0.8115 |
| 5 | INDOM | 0.3 | −0.4508 |
| 6 | DONUT | 0.5 | +0.2705 |
| 7 | FROWN | 0.2 | −0.8115 |
GRPO will now increase the probability of generating FOUND-like guesses and decrease the probability of WORD-like guesses. Learning is happening.
07 Temperature — The Critical Knob
GRPO requires two things from its response batch: diversity in responses and diversity in rewards. Temperature controls both. Get it wrong and learning stalls.
08 LLM as Judge — Beyond Verifiable Tasks
Programmable reward functions work beautifully for tasks with clear correctness criteria — math, code, Wordle. But what about tasks where "good" is harder to define, like creative writing quality or nuanced reasoning?
The answer is to use a separate LLM as a proxy for human judgement — a "judge" model that scores responses on subjective criteria. This creates a reward function from an LLM's assessment, allowing GRPO to fine-tune in situations where outcomes are not easily verifiable by a simple programmatic test.
The judge model acts as a stand-in for a human annotator — but runs automatically at scale. The key requirement is that the judge model's scoring aligns well enough with actual human preferences that optimising for it produces genuinely better outputs. This is a design decision worth testing carefully.
Summary — which method to use?
| Method | Reward Source | Data Needed | Best For |
|---|---|---|---|
| SFT | Ground truth labels | 1000s of labelled pairs | Simple, well-defined tasks with abundant data |
| RLHF | Human rankings | Ranked generations + reward model | Preference alignment — tone, helpfulness |
| DPO | Human preference pairs | Chosen/rejected pairs | Simpler preference alignment than RLHF |
| GRPO ✦ | Programmatic / LLM judge | As few as 10 examples | Multi-step reasoning, math, code, verifiable tasks |