Backpropagation Neural Networks Chain Rule micrograd · 15 min read

00 What's under the hood

Every serious neural network in the world is trained by one algorithm — and most of the people who rely on it have never watched it run.

They import it. A single line, loss.backward(), and a rack of GPUs quietly sets about adjusting millions of weights. What actually happens inside that call is treated, reasonably enough, as someone else's problem.

Andrej Karpathy's micrograd lecture is built on the hunch that this is a mistake. The move is almost contrarian in its plainness: shut the libraries, drop the GPUs, and rebuild the whole machine by hand on ordinary numbers in a blank notebook — until a small neural network defines and trains itself in front of you, every operation visible, nothing tucked away.

What micrograd actually implements — the thing doing the work behind that one loss.backward() line — is backpropagation: the algorithm that efficiently works out the gradient of a loss function with respect to a network's weights. That gradient is the signal we use to nudge the weights so the loss goes down and the predictions improve — and it sits at the mathematical heart of every serious deep-learning framework, PyTorch and JAX included.

A surprisingly small idea The engine that gives you the full power of neural-network training is about 100 lines of very simple Python. The neural-network library on top of it is another ~150. Everything else in a production framework is just efficiency.

One disclaimer up front: micrograd is a scalar-valued engine. It works on individual numbers like 2.0 and -3.0, breaking neural nets all the way down to the atoms of little pluses and times. You'd never do this in production — real libraries pack scalars into n-dimensional tensors so the work can run in parallel. But none of the math changes. Doing it at the scalar level is what makes the ideas impossible to forget.

This post follows Andrej Karpathy's lecture “The spelled-out intro to neural networks and backpropagation: building micrograd.”  Video · micrograd on GitHub

01 What backpropagation actually is

A neural network takes two kinds of input — the data and the weights — runs them through a mathematical expression, and produces predictions and, ultimately, a single loss: a number measuring how wrong the predictions are. We want that number small.

Backpropagation answers exactly one question, but for every weight at once: if I nudge this weight a tiny bit, which way does the loss move, and how strongly? That sensitivity is the gradient. Once you have it for every weight, you know how to change all of them to push the loss down.

Forward pass compute the loss Backward pass chain rule → gradients Update weights step down the gradient repeat — one training step
forward (data → loss) backward (loss → gradients) update & repeat

This loop — forward, backward, nudge, repeat — is the mathematical core of every modern deep-learning library, PyTorch and JAX included. And it's more general than neural nets: backprop doesn't care what the expression means. It computes derivatives through any chain of operations. We just happen to point that machinery at neural networks.

The plan We'll make this concrete in four moves: (1) get a gut feel for what a gradient is, (2) compute one by hand, (3) backpropagate through a small expression manually, then (4) wrap the whole thing in code and confirm it against PyTorch.

02 What is a gradient?

A gradient is just a derivative — and a derivative is just a slope. It measures the instantaneous rate at which a function's output changes when you nudge its input. Geometrically, it's the slope of the line that grazes the curve at a point. Formally, it's a limit:

$$f'(x)=\lim_{h\to 0}\frac{f(x+h)-f(x)}{h}$$

Read it plainly: bump the input by a tiny amount $$h$$, see how much the output moved, and divide by $$h$$. Does the function go up or down, and how steeply? That ratio — rise over run — is the slope at that point.

Intuition: the slope changes along the curve

Take a simple parabola, $$f(x)=3x^2-4x+5$$. We don't need to symbolically differentiate it (no one differentiates a real neural net by hand — the expression would have tens of thousands of terms). We just need to feel what the slope is doing at different points.

x f(x) -2-10 123 slope < 0 slope = 0 slope > 0
f(x) tangent — the slope at that point
Same curve, three points, three different slopes. On the left the function is heading down (negative slope); at the bottom it's momentarily flat (slope zero); on the right it's climbing (positive slope).

To get an actual number, we don't take $$h$$ all the way to zero — we just pick something tiny, like $$0.0001$$, and apply the definition directly:

python
def f(x): return 3*x**2 - 4*x + 5 h = 0.0001 x = 3.0 slope = (f(x + h) - f(x)) / h print(slope) # ≈ 14.0 (and indeed f'(x) = 6x − 4, so f'(3) = 14)

Bump $$x$$ up from 3 and the output grows with slope ~14. Try $$x=-3$$ and you'd get a negative number — there the curve heads down. And right at the bottom of the parabola the slope is zero: nudging the input barely moves the output at all. That sign and magnitude is the entire signal backprop is after.

Computing a gradient by hand

Now a function of several inputs, $$d = a\cdot b + c$$, so we can ask for the slope with respect to each one separately. Same trick: fix the inputs, nudge one of them by $$h$$, and measure the response.

python
h = 0.0001 a, b, c = 2.0, -3.0, 10.0 d1 = a*b + c d2 = (a + h)*b + c # nudge a print((d2 - d1) / h) # -3.0 → dd/da = b d2 = a*(b + h) + c # nudge b print((d2 - d1) / h) # 2.0 → dd/db = a d2 = a*b + (c + h) # nudge c print((d2 - d1) / h) # 1.0 → dd/dc = 1

The results line up with calculus exactly: differentiating $$a\cdot b + c$$ with respect to $$a$$ gives $$b$$ (which is $$-3$$), with respect to $$b$$ gives $$a$$ (which is $$2$$), and with respect to $$c$$ gives $$1$$. Each gradient tells us how its input is pulling on the output. That's all we ever compute — we just need a way to do it through a much longer chain of operations without nudging every input numerically. That way is the chain rule.

03 Backpropagation by hand

Let's build a slightly bigger expression and propagate gradients all the way through it manually. With inputs $$a=2,\,b=-3,\,c=10,\,f=-2$$, define:

e = a · b d = e + c L = d · f

First, the forward pass

The forward pass just evaluates the expression left-to-right, substituting numbers as we go. Nothing clever happens here — it is ordinary arithmetic:

forward pass
e = a * b = 2 * (-3) = -6 d = e + c = -6 + 10 = 4 L = d * f = 4 * (-2) = -8

Reading it line by line: e multiplies the two inputs, $$2$$ and $$-3$$, giving $$-6$$. Then d adds $$c=10$$ to that result, landing on $$4$$. Finally L multiplies by $$f=-2$$, for an output of $$-8$$. Nothing here is special to neural networks — it is the same substitution you'd do on paper.

Along the way every intermediate value is recorded — $$e=-6$$ and $$d=4$$ — because the backward pass will need them. Now we want the gradient of $$L$$ with respect to every node, including the leaves $$a,b,c,f$$. In a real network those leaves would be the weights and $$L$$ would be the loss. We compute these right-to-left: the backward pass.

× b × a + 1 + 1 × f × d × + × a data 2.00 grad 6.00 b data -3.00 grad -4.00 c data 10.00 grad -2.00 e = a×b data -6.00 grad -2.00 d = e+c data 4.00 grad -2.00 f data -2.00 grad 4.00 L = d×f data -8.00 grad 1.00 ▶ Forward — compute data ◀ Backward — chain-rule gradients
data & forward flow grad & backward flow
Each box carries two numbers: data (the forward value, blue) and grad (the derivative of L with respect to it, pink). Pink edges are labelled with the local derivative at each operation.

The one rule we need: the chain rule

The backward pass is nothing but the chain rule from calculus, applied over and over:

Definition · the chain rule If a variable $$z$$ depends on a variable $$y$$, which itself depends on a variable $$x$$, then $$z$$ depends on $$x$$ as well — through the intermediate variable $$y$$ — and the derivatives multiply:
$$\frac{dz}{dx}=\frac{dz}{dy}\cdot\frac{dy}{dx}$$
In plain terms: to find how $$x$$ affects $$z$$, multiply how $$x$$ affects $$y$$ by how $$y$$ affects $$z$$. Backpropagation is this one rule applied over and over, from the output back to every input.

That single fact is the whole engine. To get the gradient of $$L$$ at some node deep in the graph, we take the gradient already computed one step closer to the output and multiply it by the local derivative of the operation connecting them. Chaining these one-step multiplications from the output all the way back to the leaves is exactly backpropagation.

The backward pass is mechanical once you see the pattern. Start at the output, where the base case is trivial — nudge $$L$$ and $$L$$ moves by the same amount:

$$\frac{dL}{dL}=1$$

Through the last multiply, $$L=d\cdot f$$. The derivative of a product with respect to one factor is the other factor:

$$\frac{dL}{dd}=f=-2,\qquad \frac{dL}{df}=d=4$$

Through the plus, $$d=e+c$$. Here's the crux of the whole algorithm. We already know how $$L$$ depends on $$d$$. We need how $$L$$ depends on $$c$$ (and $$e$$). The local question — how does $$c$$ affect $$d$$? — is easy: the derivative of a sum is $$1$$. The chain rule says we multiply the local derivative by the gradient that already arrived from above:

$$\frac{dL}{dc}=\frac{dL}{dd}\cdot\frac{dd}{dc}=-2\cdot 1=-2$$ $$\frac{dL}{de}=\frac{dL}{dd}\cdot\frac{dd}{de}=-2\cdot 1=-2$$
A plus node just routes gradient Because the local derivative of addition is exactly $$1$$, a + node copies its incoming gradient, unchanged, to both of its inputs. Gradient flows straight through it.

Through the first multiply, $$e=a\cdot b$$. Same recipe — local derivative (the other factor) times the gradient on $$e$$:

$$\frac{dL}{da}=\frac{dL}{de}\cdot\frac{de}{da}=-2\cdot b=-2\cdot(-3)=6$$ $$\frac{dL}{db}=\frac{dL}{de}\cdot\frac{de}{db}=-2\cdot a=-2\cdot 2=-4$$
This is the entire algorithm

That's it. We walked from the output backward, and at every node we did one small step: take the gradient that arrived from above, multiply by the local derivative of that node's operation, and pass it to the children. Backpropagation is just a recursive application of the chain rule, backward through the graph.

Each node only ever needs to know its own little operation — how its inputs map to its output. It knows nothing about the giant graph it sits inside. That locality is what makes it scale to billions of parameters.

And the payoff: those gradients are directions. $$\tfrac{dL}{da}=6$$ means a small increase in $$a$$ makes $$L$$ go up; to make $$L$$ go down, we'd step $$a$$ the other way. Do that for every leaf and you've taken one step of training.

04 Backpropagation in code

Doing this by hand is obviously not a plan. So we build a small Value object that wraps a single number and remembers how it was made — what operation produced it and from which children. That bookkeeping is exactly what lets us replay the chain rule automatically.

python · engine.py
class Value: def __init__(self, data, _children=(), _op=''): self.data = data self.grad = 0.0 # 0 = "no effect on the output yet" self._backward = lambda: None # how to push grad to children self._prev = set(_children) # who made me self._op = _op # which operation made me def __repr__(self): return f"Value(data={self.data}, grad={self.grad})"

Two ideas live in every node: grad starts at 0.0 (a node is assumed to have no influence until proven otherwise), and _backward is a little function that knows how to take this node's gradient and distribute it to its inputs. Each operation fills in its own _backward. The constructor also stashes _prev — the set of children this value was built from — and _op, a short label for the operation that produced it; together those two fields are what let us recover the computational graph later. __repr__ is pure convenience: it just makes a Value print as something readable like Value(data=-8.0, grad=1.0).

Now the operations. Each one builds the output value, records its children, and stores the local chain-rule step as a closure:

python
def __add__(self, other): out = Value(self.data + other.data, (self, other), '+') def _backward(): self.grad += 1.0 * out.grad # local deriv of + is 1 other.grad += 1.0 * out.grad out._backward = _backward return out def __mul__(self, other): out = Value(self.data * other.data, (self, other), '*') def _backward(): self.grad += other.data * out.grad # local deriv is the *other* input other.grad += self.data * out.grad out._backward = _backward return out

Notice these closures encode exactly the hand rules from the last section: + routes the gradient (multiply by 1), and * sends each input the other input's data, both scaled by the gradient arriving on out. Both write with += rather than = — why that matters is the first gotcha at the end of this section.

Two more operations round out the arithmetic. __pow__ raises a value to a fixed power via the power rule, $$\frac{d}{dx}x^{n}=n\,x^{n-1}$$. And because Python only calls __add__ or __mul__ when the Value sits on the left of the operator, we add the reversed hooks __radd__ and __rmul__ so that a bare number on the left — as in 2 * a — works too:

python
def __pow__(self, other): assert isinstance(other, (int, float)) # only int/float powers out = Value(self.data**other, (self,), f'**{other}') def _backward(): self.grad += other * self.data**(other-1) * out.grad # power rule out._backward = _backward return out def __radd__(self, other): # 2 + a -> a + 2 return self + other def __rmul__(self, other): # 2 * a -> a * 2 return self * other

With __pow__ in hand, negation, subtraction and division all fall out for free — -a is a * -1, a - b is a + (-b), and a / b is a * b**-1 — so none of them needs a hand-written _backward of its own.

Operations don't have to be atomic. As long as you can write a forward value and a local derivative, the node can be as composite as you like. A neuron's squashing function $$\tanh$$ is a good example — its derivative is famously $$1-\tanh^2$$:

python
def tanh(self): x = self.data t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1) out = Value(t, (self,), 'tanh') def _backward(): self.grad += (1 - t**2) * out.grad # d/dx tanh(x) = 1 − tanh²(x) out._backward = _backward return out

The level you carve your operations at is entirely up to you. All that matters is a correct forward value and a correct local gradient — then the chain rule stitches them together.

Two bugs that will bite you 1. Accumulate gradients with +=, never =. If a value is used more than once in an expression, gradient arrives along several paths, and the multivariable chain rule says those contributions add. Because every grad starts at 0.0, += sums them correctly; plain = lets the second path silently overwrite the first — a classic, hard-to-spot bug.

2. Reset gradients to zero before each backward pass. That same accumulation works against you across training steps: gradients left over from the previous iteration would pile onto the new ones and corrupt the update. So every pass through the training loop must first set each parameter's .grad back to 0.0 and only then call backward(). Forgetting this zero_grad step is the single most common training bug — it is exactly what PyTorch's optimizer.zero_grad() exists to do.

05 Getting the order right: topological sort

The one remaining job is calling each node's _backward in the correct order. A node can't push gradient to its children until its own gradient is finished — which means everything downstream of it must run first. We never want to back-propagate through a node before everything that depends on it has already deposited its share.

This ordering problem is exactly what a topological sort solves. For any directed acyclic graph it produces a linear order in which every node comes after all the nodes it depends on — and a graph can have several equally valid such orders. The figure below shows the idea on an abstract graph; micrograd builds one of these orderings for our own expression graph with a depth-first walk that adds a node only after all of its children, then runs the backward pass over that list in reverse — starting at the output.

Directed acyclic graph (DAG) — every edge points forward A B C D E F G One valid topological order — each node after the ones it depends on A B C D G F E Another valid order — same graph, still legal A G B D C F E
A topological sort lays a directed acyclic graph out so every edge points forward — each node after the ones it was built from. A graph can have several valid orderings (two are shown here). Reverse any of them and you get a legal order for the backward pass: every node receives its full gradient before it has to pass any on.

We hide all of this behind a single backward() method on the output node. It builds the order, seeds the base case $$\tfrac{dL}{dL}=1$$, and fires each node's local step:

python
def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._prev: build_topo(child) topo.append(v) build_topo(self) self.grad = 1.0 # base case: dL/dL = 1 for node in reversed(topo): node._backward()

Now the whole thing runs itself. We rebuild the same expression from Section 03 and call backward() once:

python
a = Value(2.0) b = Value(-3.0) c = Value(10.0) f = Value(-2.0) e = a * b # -6 d = e + c # 4 L = d * f # -8 L.backward() print(a.grad, b.grad, c.grad, f.grad) # 6.0 -4.0 -2.0 4.0

The same numbers we ground out by hand — 6, -4, -2, 4 — now fall out of one call. That's a working autograd engine.

06 The same thing in PyTorch

Everything we built mirrors a production library. PyTorch is what you'd actually use — and our Value object was modelled on its API. The big difference is that PyTorch is built around tensors (n-dimensional arrays of scalars) so the same operations run massively in parallel. Our scalar engine is simply the special case where every tensor holds a single element.

Here is the exact expression from above, in PyTorch. We cast to double to match Python's default float precision, and mark the leaves as requiring gradients (PyTorch skips them by default for efficiency):

python · pytorch
import torch a = torch.tensor([2.0], dtype=torch.double, requires_grad=True) b = torch.tensor([-3.0], dtype=torch.double, requires_grad=True) c = torch.tensor([10.0], dtype=torch.double, requires_grad=True) f = torch.tensor([-2.0], dtype=torch.double, requires_grad=True) e = a * b d = e + c L = d * f L.backward() print(a.grad.item(), b.grad.item(), c.grad.item(), f.grad.item()) # 6.0 -4.0 -2.0 4.0

Identical gradients. PyTorch tensors carry a .data and a .grad just like our Value, expose a .backward(), and run the very same chain rule under the hood — only over big arrays instead of single numbers, with all the parallelism that buys.

Same skeleton, all the way up A real network swaps the toy expression for a much bigger graph, the loss for something like cross-entropy, and plain gradient descent for fancier update rules. But the core never changes: build a graph, run forward to a single loss, call backward to get gradients, nudge the weights, repeat. Master this and you've met every moving part of training — from a 41-parameter net to a model with hundreds of billions.

07 The big picture

Stepping back, here is the whole story in one breath. A neural net is a fairly simple mathematical expression that takes the data and the weights as input, runs a forward pass, and ends in a loss function measuring how good the predictions are. We arrange that loss so it is low when the network behaves well.

  • Forward pass — evaluate the expression left-to-right to get the loss.
  • Backward pass — apply the chain rule right-to-left to get the gradient of the loss with respect to every weight.
  • Gradient descent — nudge each weight a small step against its gradient, lowering the loss locally, and iterate many times.

That's backpropagation, and that's training. The gradient gives us a handle on a quantity we can't otherwise control — the final loss — and following it, over and over, is what lets a blob of simulated neural tissue learn to do remarkably complex things. Everything past this point is efficiency and scale.

Keep going The natural next step is to stack these Value nodes into a neuron, neurons into a layer, layers into a multi-layer perceptron, define a loss over a dataset, and run the train loop. The engine you just built is all you need — see the full micrograd repository and Karpathy's lecture.