LLM Response Reward ✦ Update score +0.8 -0.1 +0.7 GRPO FEEDBACK LOOP DEEPLEARNING.AI · ANDREW NG · AUG 2025 Reinforcement Fine-tuning & GRPO SFT · RLHF · DPO · GRPO · REWARD FUNCTIONS · WORDLE

Deep Dive · Reinforcement Learning · August 2025

How to Fine-tune LLMs with GRPO

Group Relative Policy Optimization (GRPO) is the algorithm that powers DeepSeek-R1's remarkable reasoning abilities — and it may be the most practical advance in LLM training since instruction fine-tuning. Here's a complete breakdown: what it is, how it compares to RLHF and DPO, and how to build reward functions that actually teach a model to reason.

GRPO Reinforcement Learning Fine-tuning DeepSeek Andrew Ng · DeepLearning.ai · Aug 18, 2025

01 The Problem with Supervised Fine-tuning

For years, the default recipe for teaching an LLM a new skill was Supervised Fine-Tuning (SFT): assemble thousands of labelled prompt–response pairs, run forward and backward passes to minimise prediction error, repeat. It works well for classification, named entity recognition, and straightforward code generation.

But SFT has two structural weaknesses that become painful at scale. First, it requires thousands of high-quality labelled examples — difficult and expensive to collect, especially for reasoning-heavy tasks where even defining the "correct" answer is non-trivial. Second, it encourages overfitting: the model memorises patterns from the training distribution rather than developing generalizable reasoning strategies.

SUPERVISED FINE-TUNING (SFT) Labelled Dataset LLM Compare to correct answer Update Weights ⚠ 1000s of examples needed ⚠ Overfitting risk
SFT requires a large labelled dataset and can overfit to training patterns.

For simple tasks you can show the model a math problem and its final answer, and it will learn to generalise. For complex multi-step problems, you can include <think></think> reasoning traces alongside <answer></answer> tags to teach both output format and step-by-step reasoning simultaneously. But even then, generating high-quality reasoning traces at scale remains a significant bottleneck.

The core question SFT can't easily answer: what if there's a task where the correct answer is verifiable, but the reasoning path to get there isn't? Reinforcement Learning was built for exactly this scenario.

02 Reinforcement Learning — The Core Idea

In Reinforcement Learning, an agent learns by interacting with an environment and optimising for a reward signal — rather than mimicking fixed labelled examples. The classic intuition: training a puppy. The puppy (agent) performs a trick (action), you give or withhold a treat (reward), and the puppy updates its behaviour based on what earned the treat.

AGENT LLM ENVIRONMENT Reward Fn actions (token sequences) rewards + observations scores +0.8 –0.1
The RL loop: the agent (LLM) produces actions (token sequences), the environment scores them, rewards flow back to update the model.

Applied to LLMs, the loop looks like this: a prompt from the environment is fed to the model, which generates a token sequence as its action. That response is evaluated by a scoring function that produces a reward signal. The model weights are updated to maximise future rewards. The process repeats — on new examples or the same ones.

Example Prompt LLM Generate Response Score Reward Update Updated LLM Repeat
The RL training loop applied to LLMs — examples become prompts, responses are scored, rewards update the model.

03 RLHF and DPO — Human Preference Methods

Reinforcement Learning with Human Feedback (RLHF)

RLHF is the process that powers ChatGPT. The key idea: instead of a hand-crafted reward function, use human annotators to define what "good" means. RLHF runs in four stages:

STEP 1 Generate Responses Response A Response B Response C STEP 2 Human Ranks 👤 C > A > B STEP 3 Train Reward Model Reward Model STEP 4 Fine-tune with PPO Updated LLM PPO updates weights
RLHF requires training a separate reward model from human rankings, then using PPO to fine-tune the LLM.

Direct Preference Optimization (DPO)

DPO simplifies RLHF by skipping the separate reward model entirely. Instead of training a reward model and then applying PPO, DPO directly fine-tunes the LLM on preference pairs — prompts labelled with a chosen response and a rejected response. The algorithm maximises the probability of generating chosen responses relative to rejected ones.

  1. 1
    Generate 2 responses to each prompt
  2. 2
    Get human feedback — which response do annotators prefer? (thumbs up/down)
  3. 3
    Build preference dataset — (prompt, chosen, rejected) triplets
  4. 4
    Apply the DPO algorithm to push up probability of chosen, push down probability of rejected

Both RLHF and DPO rely on human preference rather than verifiable ground truth — and both share a critical limitation:

Neither RLHF nor DPO can teach new reasoning capabilities. They can only steer preferences — aligning the model toward responses humans rate higher. If the model doesn't already know how to solve a problem, ranking its bad attempts doesn't help it learn to solve it well.
ChallengeRLHFDPO
Data NeededRanked generations for reward modelPaired preference labels (A > B) — large volume needed
Compute/MemoryVery HighModerate
Training StabilityOften unstable — reward hacking, collapse riskMore stable, but needs lots of labels
LimitationDoesn't teach new tasks — only steers preferencesDoesn't teach new tasks — only steers preferences

04 GRPO — The DeepSeek Approach

Group Relative Policy Optimization (GRPO) is the algorithm behind DeepSeek-R1. It sidesteps the need for human preference labels entirely by replacing them with programmable reward functions — deterministic functions that score responses on verifiable criteria like correct answers, valid code, or proper formatting.

GRPO's key insight: instead of asking "which response do humans prefer?", ask "which responses are above average within this batch?" The algorithm samples multiple responses to a single prompt, scores them all, then pushes up the probability of above-average responses and down the probability of below-average ones — all relative to the group mean.

STEP 1: GENERATE Prompt → LLM Sample multiple responses Response A Response B Response C Response D STEP 2: SCORE Reward Function (programmatic, verifiable) Response A 0.6 Response B 0.7 Response C −0.1 Response D 0.7 STEP 3: UPDATE WEIGHTS GRPO computes advantage A = (reward − mean) / std ↑ A: +1.35 ↑ A: +0.63 ↓ A: −1.53 ↑ A: +0.27
GRPO's three steps: generate a group of responses, score each with a reward function, compute advantages and update weights accordingly.

How GRPO is different

The crucial difference from RLHF and DPO: no human feedback, no reward model. The reward function is a piece of code you write. For tasks with verifiable outputs — math problems, code, formatted text — this is a dramatic simplification. GRPO runs the loop directly on the reward function, enabling fine-tuning with as few as 10–20 examples.

  1. 1
    Send prompt and sample multiple responses (temperature ~0.7 for diversity)
  2. 2
    Score each response with a programmable reward function — no human annotators needed
  3. 3
    Compute advantages — normalise scores within the group (subtract mean, divide by std)
  4. 4
    Update weights — push up above-average responses, push down below-average ones
  5. 5
    Repeat — on new examples or the same ones. Each iteration makes the model a bit better

05 Benefits of RFT — and When to Use It

  • No labelled data required — you only need a means to verify correctness, not a labelled dataset
  • Works with very few examples — as few as 10, but scales well with more prompts
  • More flexible than SFT — learns from feedback rather than fixed examples, so it generalises better
  • Enables organic reasoning improvement — the model discovers better chain-of-thought strategies on its own

Tasks suited for RFT

  • Mathematical problem solving — reward functions can check numeric correctness precisely
  • Code generation and debugging — run the code and reward based on test passage
  • Logical and multi-step reasoning — tasks requiring a sequence of decisions where pattern matching fails

Decision Guide — RFT vs SFT vs RLHF

Do you have labelled (ground truth) data? No Verifiable task? No RLHF Yes RFT Yes How much? <100 Does CoT / reasoning help? Yes RFT No SFT >100k
Decision flowchart: choose RFT when you have a verifiable task and limited or no labelled data. SFT wins when you have abundant labelled examples.

06 Designing Reward Functions — The Wordle Case Study

The course uses Wordle as a running example: guess a secret 5-letter word in 6 tries or fewer, with feedback after each guess. This is a perfect testbed for GRPO because the feedback is entirely verifiable — no human judgement required.

FeedbackMeaningSymbol
🟩 GreenCorrect letter, correct position
🟨 YellowCorrect letter, wrong position
⬛ GreyLetter not in the word

Secret word: POUND

F
O
U
N
Z
FOUNZ — some letters present, none in right place
P
O
U
N
D
POUND — perfect!

Reward Function 1 — Binary (and why it fails)

The simplest reward: 1 if the guess is exactly correct, 0 otherwise. Intuitive, but it produces a critical flaw — every wrong guess gets the same score of 0, so GRPO has no signal to distinguish a near-miss from a completely wrong answer.

python · binary reward
def wordle_reward(guess: str, secret_word: str) -> int: if guess.upper() == secret_word.upper(): return 1 # correct guess else: return 0 # incorrect guess — no signal at all
With a binary reward, guessing SOUND (4 of 5 letters correct) gets the same score as guessing BRAIN (0 of 5 correct). GRPO needs diversity in rewards to identify which responses are above average — without that, the advantage for every response is zero and nothing is learned.

Reward Function 2 — Partial Credit (the fix)

Adding partial credit — rewarding correct letters in correct positions more, correct letters in wrong positions less — gives GRPO the gradient it needs. Now responses are meaningfully differentiated.

python · partial credit reward
def wordle_reward_partial_credit(guess: str, secret_word: str) -> float: if len(guess) != len(secret_word): return 0.0 # wrong length gets nothing valid_letters = set(secret_word) reward = 0.0 for letter, secret_letter in zip(guess, secret_word): if letter == secret_letter: reward += 0.2 # right letter, right position ✔ elif letter in valid_letters: reward += 0.1 # right letter, wrong position — # no reward for wrong letters return reward

Advantage Calculation

With scores in hand, GRPO computes the advantage for each response — how much better or worse it is than the group average, normalised by standard deviation:

Aᵢ = rᵢ − mean(r₁, r₂, …, rG) std(r₁, r₂, …, rG)
python · compute_advantages
def compute_advantages(rewards: list): rewards = np.array(rewards) mean_reward = np.mean(rewards) std_reward = np.std(rewards) if std_reward == 0: return [0] * len(rewards) # all same score → no signal advantages = (rewards - mean_reward) / std_reward return advantages.tolist()

Partial credit in action — secret word POUND, temperature 0.7:

#GuessRewardAdvantage
0FOUND0.8+1.3525
1NOUDI0.6+0.6312
2FOUND0.8+1.3525
3WORD 0.0−1.5328
4CROWN0.2−0.8115
5INDOM0.3−0.4508
6DONUT0.5+0.2705
7FROWN0.2−0.8115

GRPO will now increase the probability of generating FOUND-like guesses and decrease the probability of WORD-like guesses. Learning is happening.

07 Temperature — The Critical Knob

GRPO requires two things from its response batch: diversity in responses and diversity in rewards. Temperature controls both. Get it wrong and learning stalls.

Temperature = 0 ❌ All guesses identical: DOWNY DOWNY DOWNY 0.5→ 0 0.5→ 0 0.5→ 0 No diversity → no signal → no learning Temperature ≈ 0.7 ✓ Diverse, relevant guesses: FOUND NOUDI WORD 0.8+1.35 0.6+0.63 0.0−1.53 Good diversity → clear signal → learning! Temperature = 1.3 ⚠ Too random, low quality: GROUND SPIND FINDS 0.0−1.94 0.5+0.74 0.2−0.87 Diversity but low quality → slow learning
Temperature 0 → no diversity, advantage always 0. Temperature ~0.7 → optimal balance. Temperature too high → diverse but low-quality responses slow the process.

08 LLM as Judge — Beyond Verifiable Tasks

Programmable reward functions work beautifully for tasks with clear correctness criteria — math, code, Wordle. But what about tasks where "good" is harder to define, like creative writing quality or nuanced reasoning?

The answer is to use a separate LLM as a proxy for human judgement — a "judge" model that scores responses on subjective criteria. This creates a reward function from an LLM's assessment, allowing GRPO to fine-tune in situations where outcomes are not easily verifiable by a simple programmatic test.

The judge model acts as a stand-in for a human annotator — but runs automatically at scale. The key requirement is that the judge model's scoring aligns well enough with actual human preferences that optimising for it produces genuinely better outputs. This is a design decision worth testing carefully.

Summary — which method to use?

MethodReward SourceData NeededBest For
SFTGround truth labels1000s of labelled pairsSimple, well-defined tasks with abundant data
RLHFHuman rankingsRanked generations + reward modelPreference alignment — tone, helpfulness
DPOHuman preference pairsChosen/rejected pairsSimpler preference alignment than RLHF
GRPO ✦Programmatic / LLM judgeAs few as 10 examplesMulti-step reasoning, math, code, verifiable tasks