Reinforcement Fine-tuning LLMs with GRPO

01 The Problem with Supervised Fine-tuning

For years, the default recipe for teaching an LLM a new skill was Supervised Fine-Tuning (SFT): assemble thousands of labelled prompt–response pairs, run forward and backward passes to minimise prediction error, repeat. It works well for classification, named entity recognition, and straightforward code generation.

But SFT has two structural weaknesses that become painful at scale. First, it requires thousands of high-quality labelled examples — difficult and expensive to collect, especially for reasoning-heavy tasks where even defining the "correct" answer is non-trivial. Second, it encourages overfitting: the model memorises patterns from the training distribution rather than developing generalizable reasoning strategies.

SFT requires a large labelled dataset and can overfit to training patterns.

For simple tasks you can show the model a math problem and its final answer, and it will learn to generalise. For complex multi-step problems, you can include <think></think> reasoning traces alongside <answer></answer> tags to teach both output format and step-by-step reasoning simultaneously. But even then, generating high-quality reasoning traces at scale remains a significant bottleneck.

The core question SFT can't easily answer: what if there's a task where the correct answer is verifiable, but the reasoning path to get there isn't? Reinforcement Learning was built for exactly this scenario.

02 Reinforcement Learning — The Core Idea

In Reinforcement Learning, an agent learns by interacting with an environment and optimising for a reward signal — rather than mimicking fixed labelled examples. The classic intuition: training a puppy. The puppy (agent) performs a trick (action), you give or withhold a treat (reward), and the puppy updates its behaviour based on what earned the treat.

The RL loop: the agent (LLM) produces actions (token sequences), the environment scores them, rewards flow back to update the model.

Applied to LLMs, the loop looks like this: a prompt from the environment is fed to the model, which generates a token sequence as its action. That response is evaluated by a scoring function that produces a reward signal. The model weights are updated to maximise future rewards. The process repeats — on new examples or the same ones.

The RL training loop applied to LLMs — examples become prompts, responses are scored, rewards update the model.

03 RLHF and DPO — Human Preference Methods

Reinforcement Learning with Human Feedback (RLHF)

RLHF is the process that powers ChatGPT. The key idea: instead of a hand-crafted reward function, use human annotators to define what "good" means. RLHF runs in four stages:

RLHF requires training a separate reward model from human rankings, then using PPO to fine-tune the LLM.

Direct Preference Optimization (DPO)

DPO simplifies RLHF by skipping the separate reward model entirely. Instead of training a reward model and then applying PPO, DPO directly fine-tunes the LLM on preference pairs — prompts labelled with a chosen response and a rejected response. The algorithm maximises the probability of generating chosen responses relative to rejected ones.

1
Generate 2 responses to each prompt
2
Get human feedback — which response do annotators prefer? (thumbs up/down)
3
Build preference dataset — (prompt, chosen, rejected) triplets
4
Apply the DPO algorithm to push up probability of chosen, push down probability of rejected

Both RLHF and DPO rely on human preference rather than verifiable ground truth — and both share a critical limitation:

Neither RLHF nor DPO can teach new reasoning capabilities. They can only steer preferences — aligning the model toward responses humans rate higher. If the model doesn't already know how to solve a problem, ranking its bad attempts doesn't help it learn to solve it well.

Challenge	RLHF	DPO
Data Needed	Ranked generations for reward model	Paired preference labels (A > B) — large volume needed
Compute/Memory	Very High	Moderate
Training Stability	Often unstable — reward hacking, collapse risk	More stable, but needs lots of labels
Limitation	Doesn't teach new tasks — only steers preferences	Doesn't teach new tasks — only steers preferences

04 GRPO — The DeepSeek Approach

Group Relative Policy Optimization (GRPO) is the algorithm behind DeepSeek-R1. It sidesteps the need for human preference labels entirely by replacing them with programmable reward functions — deterministic functions that score responses on verifiable criteria like correct answers, valid code, or proper formatting.

GRPO's key insight: instead of asking "which response do humans prefer?", ask "which responses are above average within this batch?" The algorithm samples multiple responses to a single prompt, scores them all, then pushes up the probability of above-average responses and down the probability of below-average ones — all relative to the group mean.

GRPO's three steps: generate a group of responses, score each with a reward function, compute advantages and update weights accordingly.

How GRPO is different

The crucial difference from RLHF and DPO: no human feedback, no reward model. The reward function is a piece of code you write. For tasks with verifiable outputs — math problems, code, formatted text — this is a dramatic simplification. GRPO runs the loop directly on the reward function, enabling fine-tuning with as few as 10–20 examples.

1
Send prompt and sample multiple responses (temperature ~0.7 for diversity)
2
Score each response with a programmable reward function — no human annotators needed
3
Compute advantages — normalise scores within the group (subtract mean, divide by std)
4
Update weights — push up above-average responses, push down below-average ones
5
Repeat — on new examples or the same ones. Each iteration makes the model a bit better

05 Benefits of RFT — and When to Use It

No labelled data required — you only need a means to verify correctness, not a labelled dataset
Works with very few examples — as few as 10, but scales well with more prompts
More flexible than SFT — learns from feedback rather than fixed examples, so it generalises better
Enables organic reasoning improvement — the model discovers better chain-of-thought strategies on its own

Tasks suited for RFT

Mathematical problem solving — reward functions can check numeric correctness precisely
Code generation and debugging — run the code and reward based on test passage
Logical and multi-step reasoning — tasks requiring a sequence of decisions where pattern matching fails

Decision Guide — RFT vs SFT vs RLHF

Decision flowchart: choose RFT when you have a verifiable task and limited or no labelled data. SFT wins when you have abundant labelled examples.

06 Designing Reward Functions — The Wordle Case Study

The course uses Wordle as a running example: guess a secret 5-letter word in 6 tries or fewer, with feedback after each guess. This is a perfect testbed for GRPO because the feedback is entirely verifiable — no human judgement required.

Feedback	Meaning	Symbol
🟩 Green	Correct letter, correct position	✔
🟨 Yellow	Correct letter, wrong position	—
⬛ Grey	Letter not in the word	✗

Secret word: POUND

FOUNZ — some letters present, none in right place

POUND — perfect!

Reward Function 1 — Binary (and why it fails)

The simplest reward: 1 if the guess is exactly correct, 0 otherwise. Intuitive, but it produces a critical flaw — every wrong guess gets the same score of 0, so GRPO has no signal to distinguish a near-miss from a completely wrong answer.

python · binary reward

def wordle_reward(guess: str, secret_word: str) -> int:
    if guess.upper() == secret_word.upper():
        return 1   # correct guess
    else:
        return 0   # incorrect guess — no signal at all

With a binary reward, guessing SOUND (4 of 5 letters correct) gets the same score as guessing BRAIN (0 of 5 correct). GRPO needs diversity in rewards to identify which responses are above average — without that, the advantage for every response is zero and nothing is learned.

Reward Function 2 — Partial Credit (the fix)

Adding partial credit — rewarding correct letters in correct positions more, correct letters in wrong positions less — gives GRPO the gradient it needs. Now responses are meaningfully differentiated.

python · partial credit reward

def wordle_reward_partial_credit(guess: str, secret_word: str) -> float:
    if len(guess) != len(secret_word):
        return 0.0   # wrong length gets nothing
    valid_letters = set(secret_word)
    reward = 0.0
    for letter, secret_letter in zip(guess, secret_word):
        if letter == secret_letter:
            reward += 0.2   # right letter, right position ✔
        elif letter in valid_letters:
            reward += 0.1   # right letter, wrong position —
        # no reward for wrong letters
    return reward

Advantage Calculation

With scores in hand, GRPO computes the advantage for each response — how much better or worse it is than the group average, normalised by standard deviation:

python · compute_advantages

def compute_advantages(rewards: list):
    rewards = np.array(rewards)
    mean_reward = np.mean(rewards)
    std_reward  = np.std(rewards)
    if std_reward == 0:
        return [0] * len(rewards)   # all same score → no signal
    advantages = (rewards - mean_reward) / std_reward
    return advantages.tolist()

Partial credit in action — secret word POUND, temperature 0.7:

#	Guess	Reward	Advantage
0	FOUND	0.8	+1.3525
1	NOUDI	0.6	+0.6312
2	FOUND	0.8	+1.3525
3	WORD	0.0	−1.5328
4	CROWN	0.2	−0.8115
5	INDOM	0.3	−0.4508
6	DONUT	0.5	+0.2705
7	FROWN	0.2	−0.8115

GRPO will now increase the probability of generating FOUND-like guesses and decrease the probability of WORD-like guesses. Learning is happening.

07 Temperature — The Critical Knob

GRPO requires two things from its response batch: diversity in responses and diversity in rewards. Temperature controls both. Get it wrong and learning stalls.

Temperature 0 → no diversity, advantage always 0. Temperature ~0.7 → optimal balance. Temperature too high → diverse but low-quality responses slow the process.

08 LLM as Judge — Beyond Verifiable Tasks

Programmable reward functions work beautifully for tasks with clear correctness criteria — math, code, Wordle. But what about tasks where "good" is harder to define, like creative writing quality or nuanced reasoning?

The answer is to use a separate LLM as a proxy for human judgement — a "judge" model that scores responses on subjective criteria. This creates a reward function from an LLM's assessment, allowing GRPO to fine-tune in situations where outcomes are not easily verifiable by a simple programmatic test.

The judge model acts as a stand-in for a human annotator — but runs automatically at scale. The key requirement is that the judge model's scoring aligns well enough with actual human preferences that optimising for it produces genuinely better outputs. This is a design decision worth testing carefully.

Summary — which method to use?

Method	Reward Source	Data Needed	Best For
SFT	Ground truth labels	1000s of labelled pairs	Simple, well-defined tasks with abundant data
RLHF	Human rankings	Ranked generations + reward model	Preference alignment — tone, helpfulness
DPO	Human preference pairs	Chosen/rejected pairs	Simpler preference alignment than RLHF
GRPO ✦	Programmatic / LLM judge	As few as 10 examples	Multi-step reasoning, math, code, verifiable tasks