Gradient Descent — The Atom of Machine Learning

00 From one neuron to a billion parameters

Every conversation about deep learning eventually wants to start at the top — attention heads, billion-parameter models, emergent behaviour. This post starts at the bottom, on purpose. We are going to build a cat classifier with a single neuron and train it by hand, equation by equation, until the machinery is impossible to forget.

The task is binary classification: given an image, decide cat (1) or not-cat (0). A logistic regression model is the smallest honest answer to that question. It is also — and this is the point — structurally identical to a single unit inside any large network. The weighted sum, the non-linearity, the loss, and the gradient update you'll see below are the four verbs that every larger model conjugates billions of times.

Turning an image into a vector

A color image isn't a list of numbers yet, so we unroll it. A $$64 \times 64$$ RGB image is three stacked channel matrices; we flatten them into one tall column vector:

\underbrace{64 \times 64\ \text{Image}}_{\text{pixels}} \;\rightarrow\; \underbrace{\text{RGB}\,(64 \times 64 \times 3)}_{\text{three channels}} \;\rightarrow\; \begin{bmatrix} R_{11} \\ R_{21} \\ \vdots \\ G_{11} \\ G_{21} \\ \vdots \\ B_{11} \\ B_{21} \\ \vdots \end{bmatrix}

That gives a feature vector $$x \in \mathbb{R}^{n_x}$$ with $$n_x = 64 \times 64 \times 3 = 12{,}288$$ numbers. Everything that follows operates on this $$x$$.

Notation we'll reuse throughout

A single training example is a pair $$(x,y)$$ with $$x \in \mathbb{R}^{n_x}$$ and label $$y \in \{0,1\}$$. A dataset is $$m$$ such pairs:

\{(x^{(1)},y^{(1)}),\,\dots,\,(x^{(m)},y^{(m)})\}

We stack the examples column-by-column into matrices $$X$$ and $$Y$$ — the shape choice that makes vectorization clean later:

X = \begin{bmatrix} | & | & & | \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ | & | & & | \end{bmatrix},\quad X \in \mathbb{R}^{(n_x,\,m)}$$ $$Y = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \end{bmatrix},\quad Y \in \mathbb{R}^{(1,\,m)}

We write $$m_{\text{train}}$$ and $$m_{\text{test}}$$ when the split matters.

01 Weights

We want a probability estimate $$\hat{y} = P(y=1 \mid x)$$ — "how confident are we this is a cat?" The model's first job is to score the evidence in $$x$$, and it does this with a weight for every feature.

The weights live in a vector $$w \in \mathbb{R}^{n_x}$$, one weight per pixel-channel. The score is the dot product $$w^Tx$$: each feature is multiplied by its weight and the results are summed.

Intuition

A weight is the model's opinion about a feature. A large positive $$w_i$$ says "when pixel $$i$$ is bright, lean toward cat." A large negative weight pushes the other way. A weight near zero says the feature is noise. Training is nothing more than discovering the right opinions from data.

This same dot product is what every neuron in a deep network computes. Scale $$n_x$$ up, stack thousands of these units, and you have a layer.

02 Biases

Weights tilt the decision; the bias shifts it. The bias $$b \in \mathbb{R}$$ is a single number added after the weighted sum, giving the model's raw score:

$$z = w^Tx + b$$

Without a bias, the decision boundary is forced through the origin — the model can only ask "which features are present" and never "is there a baseline tendency toward cat regardless of the pixels?" The bias supplies exactly that adjustable threshold.

Parameters of the model Logistic regression has precisely two learnable objects: $$w \in \mathbb{R}^{n_x}$$ and $$b \in \mathbb{R}$$. Everything in the rest of this post is in service of choosing good values for these two.

03 The Sigmoid Function

The score $$z = w^Tx + b$$ can be any real number — wildly positive, wildly negative. But we promised a probability, which must live in $$[0,1]$$. The sigmoid $$\sigma$$ is the squashing function that performs the conversion:

\hat{y} = \sigma(w^Tx + b),\qquad \sigma(z) = \frac{1}{1+e^{-z}}

The sigmoid maps all of ℝ into (0, 1). The spline grows outward from the center (0, 0.5) toward both saturation limits.

Reading the curve at its extremes tells you everything about its behaviour:

If $$z$$ is very large, then $$e^{-z} \approx 0$$, so $$\sigma(z) \approx 1$$ — confidently "cat."
If $$z$$ is very small (large negative), then $$e^{-z} \to \infty$$, so $$\sigma(z) \approx 0$$ — confidently "not cat."
At $$z = 0$$, $$\sigma(0) = 0.5$$ — maximal uncertainty.

Why this matters downstream

The sigmoid is the original "activation function." Modern networks mostly swap it for ReLU in hidden layers, but it survives wherever you need a probability — the final layer of a classifier, the gates inside an LSTM, attention masks. The squash-into-a-range idea never leaves you.

04 The Loss Function

We now have a prediction $$\hat{y}$$. To improve it we need a number that says how wrong we were on a single example. That number is the loss.

The tempting (wrong) choice: squared error

The obvious loss is squared error:

\mathscr{L}(\hat{y},y) = \tfrac{1}{2}(\hat{y}-y)^2

It works for linear regression, but it sabotages logistic regression. Because $$\hat{y}$$ comes through the sigmoid, the resulting optimization surface is non-convex — riddled with local minima where gradient descent can get stuck. We want a bowl, not a mountain range.

The right choice: log loss

Instead we use the binary cross-entropy (log) loss:

\mathscr{L}(\hat{y},y) = -\Big[\,y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}\,\Big]

It looks opaque until you plug in the two possible labels — then it reads like plain English:

When y = 1 The second term vanishes, leaving $$\mathscr{L} = -\log{\hat{y}}$$. To make the loss small, $$\log{\hat{y}}$$ must be large, so $$\hat{y}$$ must be pushed toward its ceiling of 1 — exactly what we want for a true cat. The sigmoid caps it at 1.

When y = 0 The first term vanishes, leaving $$\mathscr{L} = -\log{(1-\hat{y})}$$. To make the loss small, $$1-\hat{y}$$ must be large, so $$\hat{y}$$ must be pushed toward its floor of 0 — exactly what we want for a non-cat. The sigmoid caps it at 0.

The loss is engineered so that "small loss" and "correct, confident prediction" mean the same thing. And crucially, it is convex in $$w,b$$ — a single global minimum to descend into.

05 The Cost Function

Loss scores one example. To train, we need a verdict on the whole training set. That's the cost function $$J$$ — the average loss over all $$m$$ examples:

J(w,b) = \frac{1}{m} \sum_{i=1}^{m} \mathscr{L}\!\left(\hat{y}^{(i)},\,y^{(i)}\right)

Expanding the loss gives the form we'll actually differentiate:

J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} \Big[\,y^{(i)}\log{\hat{y}^{(i)}} + (1-y^{(i)})\log{(1-\hat{y}^{(i)})}\,\Big]

Loss vs. cost — the one-line distinction Loss measures performance on a single example. Cost measures performance over the entire training set. Training means finding $$w, b$$ that minimize $$J(w,b)$$.

And that — minimizing a function — is a job for calculus. Enter gradient descent.

06 Gradient Descent

We have a cost surface $$J(w,b)$$ and we want its lowest point. We can't solve for it directly, so we walk downhill. Gradient descent is the rule for taking each step:

\text{Repeat:}\quad w := w - \alpha\,\frac{\partial J(w)}{ \partial w}

where

$$\alpha$$ is the learning rate — the step size,
$$\dfrac{\partial J(w)}{\partial w}$$ is the slope of the cost at the current point,
$$:=$$ means "update / assign."

Each iteration subtracts α·(dJ/dw), stepping the weight downhill along the gradient toward the global cost minimum.

The sign of the slope handles direction automatically:

Start on the right of the bowl: $$\frac{\partial J}{\partial w} > 0$$, so we subtract a positive quantity and the point moves left, toward the minimum.
Start on the left: $$\frac{\partial J}{\partial w} < 0$$, so we subtract a negative quantity — effectively add — and the point moves right, toward the minimum.

Either way, the point approaches the bottom. That self-correcting behaviour is why convexity (from the log loss) mattered so much.

Since $$J$$ depends on both parameters, we update each with its own partial derivative:

w := w - \alpha\,\frac{\partial J(w,b)}{\partial w},\qquad b := b - \alpha\,\frac{\partial J(w,b)}{\partial b}

The symbol $$\partial$$ is the partial derivative — the same idea as $$d$$, used when a function depends on more than one variable.

The computation graph: bookkeeping for derivatives

Before differentiating the real model, it helps to see how derivatives flow through a chain of operations. Take a toy function $$J(a,b,c) = 3(a+bc)$$ and break it into atomic steps:

u = b·c→ v = u + a→ J = 3v

forward — compute values backward — compute derivatives

Forward pass computes J left-to-right; backward pass propagates derivatives right-to-left via the chain rule.

Computing $$J$$ left-to-right is the forward pass. Computing derivatives right-to-left is the backward pass. With sample values $$a=5,\,b=3,\,c=2$$ (so $$u=6,\,v=11,\,J=33$$):

One step back — nudge $$v$$ by $$0.001$$ and $$J$$ moves by $$0.003$$, so:

\frac{dJ}{dv} = 3

The chain rule propagates this backward through every node:

\frac{\partial J}{\partial a} = \frac{\partial J}{\partial v}\cdot\frac{\partial v}{\partial a} = 3\cdot 1 = 3$$ $$\frac{\partial J}{\partial u} = \frac{\partial J}{\partial v}\cdot\frac{\partial v}{\partial u} = 3\cdot 1 = 3$$ $$\frac{\partial J}{\partial b} = \frac{\partial J}{\partial u}\cdot\frac{\partial u}{\partial b} = 3\cdot c = 3\cdot 2 = 6$$ $$\frac{\partial J}{\partial c} = \frac{\partial J}{\partial u}\cdot\frac{\partial u}{\partial c} = 3\cdot b = 3\cdot 3 = 9

This is backpropagation

The computation graph plus the chain rule is backprop. The only difference in a deep network is that the graph is enormous and the special output node being minimized is the cost $$J$$. Autograd engines like PyTorch's automate exactly this backward pass.

07 Gradient Descent on Logistic Regression

Now we apply the graph to the actual model — for a single example with two features $$x_1, x_2$$, weights $$w_1, w_2$$, and bias $$b$$.

Forward pass

z = w^Tx + b$$ $$\hat{y} = a = \sigma(z)$$ $$\mathscr{L}(a,y) = -\big(y\log a + (1-y)\log(1-a)\big)

The logistic regression computation graph: inputs flow forward to the loss; gradients flow backward, with the clean dz = a − y at its core.

Backward pass — the gradients

Working right-to-left through the graph. First the derivative of loss with respect to the activation:

da = \frac{\partial \mathscr{L}}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}

Then the famous simplification. Because the sigmoid's derivative is $$\frac{\partial a}{\partial z} = a(1-a)$$, the messy $$\frac{\partial a}{\partial z}$$ collapses beautifully:

dz = \frac{\partial \mathscr{L}}{\partial z} = \frac{\partial \mathscr{L}}{\partial a}\cdot\frac{\partial a}{\partial z} = \left(-\frac{y}{a} + \frac{1-y}{1-a}\right)\cdot a(1-a) = a - y

The one equation to memorize $$dz = a - y$$ — the gradient at the score is simply prediction minus truth. This elegant cancellation is precisely why log loss pairs with sigmoid. The error signal is the raw mistake.

From $$dz$$, the parameter gradients follow by one more chain-rule hop:

dw_1 = x_1\cdot dz,\qquad dw_2 = x_2\cdot dz,\qquad db = dz

Update the parameters

w_1 := w_1 - \alpha\,dw_1,\qquad w_2 := w_2 - \alpha\,dw_2,\qquad b := b - \alpha\,db

One example, one downhill step. Now scale it to the whole dataset.

08 Gradient Descent Across m Examples

The cost averages the loss over all examples, so its gradient averages the per-example gradients:

J(w,b) = \frac{1}{m}\sum_{i=1}^{m}\mathscr{L}\!\left(a^{(i)},y^{(i)}\right),\qquad a^{(i)} = \hat{y}^{(i)} = \sigma\!\left(w^Tx^{(i)} + b\right)$$ $$\frac{\partial J(w,b)}{\partial w_1} = \frac{1}{m}\sum_{i=1}^{m}\frac{\partial}{\partial w_1}\mathscr{L}\!\left(a^{(i)},y^{(i)}\right)

The naive algorithm (two nested loops)

Translated directly into code-like pseudocode, with $$n=2$$ features:

pseudocode

# initialize accumulators
J = 0 ; dw1 = 0 ; dw2 = 0 ; db = 0

for i = 1 to m:                 # loop 1: over examples
    z[i]  = wᵀ·x[i] + b
    a[i]  = sigmoid(z[i])
    J    -= y[i]·log(a[i]) + (1-y[i])·log(1-a[i])
    dz[i] = a[i] - y[i]
    dw1  += x1[i]·dz[i]            # loop 2 hides here (per feature)
    dw2  += x2[i]·dz[i]
    db   += dz[i]

J /= m ; dw1 /= m ; dw2 /= m ; db /= m     # average

w1 = w1 - alpha·dw1                # one gradient-descent step
w2 = w2 - alpha·dw2
b  = b  - alpha·db

Where this breaks down There are two for-loops: an outer one over the $$m$$ examples, and an inner one over the $$n_x$$ features (here unrolled into dw1, dw2). For a 12,288-feature image and millions of examples, explicit Python loops are catastrophically slow. We need to delete the loops without changing the math. That's vectorization.

09 Vectorization

Vectorization replaces explicit loops with whole-array operations. Computing $$z = w^Tx + b$$ in vectorized form rather than a Python loop can be over 300× faster, because it exploits SIMD (Single Instruction, Multiple Data) hardware — the CPU/GPU applies one instruction across many data points at once.

Rule of thumb Avoid explicit for-loops within reason. If you're iterating element-by-element in Python, there is almost always a matrix operation that does it faster.

Vectorized forward propagation

For one example we had $$z^{(i)} = w^Tx^{(i)} + b$$ and $$a^{(i)} = \sigma(z^{(i)})$$. For all $$m$$ at once, stack the examples into $$X \in \mathbb{R}^{(n_x,m)}$$ and compute every score in a single matrix product:

Z = w^TX + \begin{bmatrix} b & b & \cdots & b \end{bmatrix} = \begin{bmatrix} w^Tx^{(1)}+b & w^Tx^{(2)}+b & \cdots & w^Tx^{(m)}+b \end{bmatrix}$$ $$Z = \begin{bmatrix} z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix}

In NumPy the scalar $$b$$ is broadcast automatically into the row:

python

Z = np.dot(w.T, X) + b

The activation applies sigmoid element-wise across the whole row:

A = \sigma(Z) = \begin{bmatrix} \sigma(z^{(1)}) & \cdots & \sigma(z^{(m)}) \end{bmatrix} = \begin{bmatrix} a^{(1)} & \cdots & a^{(m)} \end{bmatrix}

python

A = sigmoid(Z)

Vectorized backward propagation

The single-example gradient $$dz^{(i)} = a^{(i)} - y^{(i)}$$ becomes one elementwise subtraction across the whole dataset:

dZ = A - Y = \begin{bmatrix} a^{(1)}-y^{(1)} & a^{(2)}-y^{(2)} & \cdots & a^{(m)}-y^{(m)} \end{bmatrix}

python

dZ = A - Y

The weight gradient — previously an accumulation loop $$dw\mathrel{+}=x^{(i)}dz^{(i)}$$ followed by a divide — is a single matrix product:

dw = \frac{1}{m}\,X\,dZ^{T}

python

dw = np.dot(X, dZ.T) / m

And the bias gradient is just the averaged sum:

db = \frac{1}{m}\sum_{i=1}^{m} dz^{(i)}

python

db = np.sum(dZ) / m

Both for-loops are gone, replaced by four array operations the hardware executes in parallel.

10 Implementation — NumPy & PyTorch

The whole training loop in NumPy

Every equation above collapses into a tight loop over gradient-descent iterations. The inner loops over examples and features have completely vanished:

python · numpy

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# X: (n_x, m) features   Y: (1, m) labels
# w: (n_x, 1) weights    b: scalar bias
w = np.zeros((X.shape[0], 1))
b = 0.0
alpha = 0.01
m = X.shape[1]

for i in range(1000):
    Z  = np.dot(w.T, X) + b        # forward: scores
    A  = sigmoid(Z)                # forward: probabilities

    dZ = A - Y                     # backward: dz = a - y
    dw = np.dot(X, dZ.T) / m       # backward: weight gradient
    db = np.sum(dZ) / m            # backward: bias gradient

    w  = w - alpha * dw            # update
    b  = b - alpha * db

That is a complete, working classifier — built from nothing but the dot product, the sigmoid, the log loss, and a derivative.

The same model in PyTorch

PyTorch hides the backward pass behind autograd — the computation graph from Section 6, built and differentiated automatically. You write only the forward pass; loss.backward() fills in every gradient:

python · pytorch

import torch
import torch.nn as nn

# X: (m, n_x)   y: (m, 1)   — PyTorch convention: rows = examples
model     = nn.Linear(n_x, 1)              # holds w and b
criterion = nn.BCEWithLogitsLoss()           # sigmoid + log loss, fused
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(1000):
    logits = model(X)                # forward: z = w·x + b
    loss   = criterion(logits, y)    # cost J

    optimizer.zero_grad()            # reset accumulated grads
    loss.backward()                  # autograd backward pass
    optimizer.step()                 # w -= alpha·dw ; b -= alpha·db

Map it back to the math nn.Linear is $$w^Tx + b$$. BCEWithLogitsLoss fuses the sigmoid and the log loss — and applies the $$dz = a - y$$ simplification internally for numerical stability. loss.backward() is the backward pass through the computation graph. optimizer.step() is the gradient-descent update. Nothing new — just automated.

Where this goes next

Swap nn.Linear(n_x, 1) for a stack of linear layers with non-linearities between them and you have a deep neural network. Replace the dense layers with convolutions and you have a vision model. Replace them with attention and you have a transformer. In every case the four verbs are unchanged: score with weights and a bias, squash with a non-linearity, measure with a loss, and descend the gradient.

That's why logistic regression is worth this much attention. It isn't a stepping stone you leave behind — it's the atom every larger model is built from.

Gradient Descent:the atom of machine learning

00 From one neuron to a billion parameters

Turning an image into a vector

Notation we'll reuse throughout

01 Weights

02 Biases

03 The Sigmoid Function

04 The Loss Function

The tempting (wrong) choice: squared error

The right choice: log loss

05 The Cost Function

06 Gradient Descent

The computation graph: bookkeeping for derivatives

07 Gradient Descent on Logistic Regression

Forward pass

Backward pass — the gradients

Update the parameters

08 Gradient Descent Across m Examples

The naive algorithm (two nested loops)

09 Vectorization

Vectorized forward propagation

Vectorized backward propagation

10 Implementation — NumPy & PyTorch

The whole training loop in NumPy

The same model in PyTorch

Where this goes next

Gradient Descent:
the atom of machine learning