00 From one neuron to a billion parameters
Every conversation about deep learning eventually wants to start at the top — attention heads, billion-parameter models, emergent behaviour. This post starts at the bottom, on purpose. We are going to build a cat classifier with a single neuron and train it by hand, equation by equation, until the machinery is impossible to forget.
The task is binary classification: given an image, decide cat (1) or not-cat (0). A logistic regression model is the smallest honest answer to that question. It is also — and this is the point — structurally identical to a single unit inside any large network. The weighted sum, the non-linearity, the loss, and the gradient update you'll see below are the four verbs that every larger model conjugates billions of times.
Turning an image into a vector
A color image isn't a list of numbers yet, so we unroll it. A $$64 \times 64$$ RGB image is three stacked channel matrices; we flatten them into one tall column vector:
That gives a feature vector $$x \in \mathbb{R}^{n_x}$$ with $$n_x = 64 \times 64 \times 3 = 12{,}288$$ numbers. Everything that follows operates on this $$x$$.
Notation we'll reuse throughout
A single training example is a pair $$(x,y)$$ with $$x \in \mathbb{R}^{n_x}$$ and label $$y \in \{0,1\}$$. A dataset is $$m$$ such pairs:
We stack the examples column-by-column into matrices $$X$$ and $$Y$$ — the shape choice that makes vectorization clean later:
We write $$m_{\text{train}}$$ and $$m_{\text{test}}$$ when the split matters.
01 Weights
We want a probability estimate $$\hat{y} = P(y=1 \mid x)$$ — "how confident are we this is a cat?" The model's first job is to score the evidence in $$x$$, and it does this with a weight for every feature.
The weights live in a vector $$w \in \mathbb{R}^{n_x}$$, one weight per pixel-channel. The score is the dot product $$w^Tx$$: each feature is multiplied by its weight and the results are summed.
A weight is the model's opinion about a feature. A large positive $$w_i$$ says "when pixel $$i$$ is bright, lean toward cat." A large negative weight pushes the other way. A weight near zero says the feature is noise. Training is nothing more than discovering the right opinions from data.
This same dot product is what every neuron in a deep network computes. Scale $$n_x$$ up, stack thousands of these units, and you have a layer.
02 Biases
Weights tilt the decision; the bias shifts it. The bias $$b \in \mathbb{R}$$ is a single number added after the weighted sum, giving the model's raw score:
Without a bias, the decision boundary is forced through the origin — the model can only ask "which features are present" and never "is there a baseline tendency toward cat regardless of the pixels?" The bias supplies exactly that adjustable threshold.
03 The Sigmoid Function
The score $$z = w^Tx + b$$ can be any real number — wildly positive, wildly negative. But we promised a probability, which must live in $$[0,1]$$. The sigmoid $$\sigma$$ is the squashing function that performs the conversion:
Reading the curve at its extremes tells you everything about its behaviour:
- If $$z$$ is very large, then $$e^{-z} \approx 0$$, so $$\sigma(z) \approx 1$$ — confidently "cat."
- If $$z$$ is very small (large negative), then $$e^{-z} \to \infty$$, so $$\sigma(z) \approx 0$$ — confidently "not cat."
- At $$z = 0$$, $$\sigma(0) = 0.5$$ — maximal uncertainty.
The sigmoid is the original "activation function." Modern networks mostly swap it for ReLU in hidden layers, but it survives wherever you need a probability — the final layer of a classifier, the gates inside an LSTM, attention masks. The squash-into-a-range idea never leaves you.
04 The Loss Function
We now have a prediction $$\hat{y}$$. To improve it we need a number that says how wrong we were on a single example. That number is the loss.
The tempting (wrong) choice: squared error
The obvious loss is squared error:
It works for linear regression, but it sabotages logistic regression. Because $$\hat{y}$$ comes through the sigmoid, the resulting optimization surface is non-convex — riddled with local minima where gradient descent can get stuck. We want a bowl, not a mountain range.
The right choice: log loss
Instead we use the binary cross-entropy (log) loss:
It looks opaque until you plug in the two possible labels — then it reads like plain English:
The loss is engineered so that "small loss" and "correct, confident prediction" mean the same thing. And crucially, it is convex in $$w,b$$ — a single global minimum to descend into.
05 The Cost Function
Loss scores one example. To train, we need a verdict on the whole training set. That's the cost function $$J$$ — the average loss over all $$m$$ examples:
Expanding the loss gives the form we'll actually differentiate:
And that — minimizing a function — is a job for calculus. Enter gradient descent.
06 Gradient Descent
We have a cost surface $$J(w,b)$$ and we want its lowest point. We can't solve for it directly, so we walk downhill. Gradient descent is the rule for taking each step:
where
- $$\alpha$$ is the learning rate — the step size,
- $$\dfrac{\partial J(w)}{\partial w}$$ is the slope of the cost at the current point,
- $$:=$$ means "update / assign."
The sign of the slope handles direction automatically:
- Start on the right of the bowl: $$\frac{\partial J}{\partial w} > 0$$, so we subtract a positive quantity and the point moves left, toward the minimum.
- Start on the left: $$\frac{\partial J}{\partial w} < 0$$, so we subtract a negative quantity — effectively add — and the point moves right, toward the minimum.
Either way, the point approaches the bottom. That self-correcting behaviour is why convexity (from the log loss) mattered so much.
Since $$J$$ depends on both parameters, we update each with its own partial derivative:
The symbol $$\partial$$ is the partial derivative — the same idea as $$d$$, used when a function depends on more than one variable.
The computation graph: bookkeeping for derivatives
Before differentiating the real model, it helps to see how derivatives flow through a chain of operations. Take a toy function $$J(a,b,c) = 3(a+bc)$$ and break it into atomic steps:
Computing $$J$$ left-to-right is the forward pass. Computing derivatives right-to-left is the backward pass. With sample values $$a=5,\,b=3,\,c=2$$ (so $$u=6,\,v=11,\,J=33$$):
One step back — nudge $$v$$ by $$0.001$$ and $$J$$ moves by $$0.003$$, so:
The chain rule propagates this backward through every node:
The computation graph plus the chain rule is backprop. The only difference in a deep network is that the graph is enormous and the special output node being minimized is the cost $$J$$. Autograd engines like PyTorch's automate exactly this backward pass.
07 Gradient Descent on Logistic Regression
Now we apply the graph to the actual model — for a single example with two features $$x_1, x_2$$, weights $$w_1, w_2$$, and bias $$b$$.
Forward pass
Backward pass — the gradients
Working right-to-left through the graph. First the derivative of loss with respect to the activation:
Then the famous simplification. Because the sigmoid's derivative is $$\frac{\partial a}{\partial z} = a(1-a)$$, the messy $$\frac{\partial a}{\partial z}$$ collapses beautifully:
From $$dz$$, the parameter gradients follow by one more chain-rule hop:
Update the parameters
One example, one downhill step. Now scale it to the whole dataset.
08 Gradient Descent Across m Examples
The cost averages the loss over all examples, so its gradient averages the per-example gradients:
The naive algorithm (two nested loops)
Translated directly into code-like pseudocode, with $$n=2$$ features:
dw1, dw2). For a 12,288-feature image and millions of examples, explicit Python loops are catastrophically slow. We need to delete the loops without changing the math. That's vectorization.
09 Vectorization
Vectorization replaces explicit loops with whole-array operations. Computing $$z = w^Tx + b$$ in vectorized form rather than a Python loop can be over 300× faster, because it exploits SIMD (Single Instruction, Multiple Data) hardware — the CPU/GPU applies one instruction across many data points at once.
Vectorized forward propagation
For one example we had $$z^{(i)} = w^Tx^{(i)} + b$$ and $$a^{(i)} = \sigma(z^{(i)})$$. For all $$m$$ at once, stack the examples into $$X \in \mathbb{R}^{(n_x,m)}$$ and compute every score in a single matrix product:
In NumPy the scalar $$b$$ is broadcast automatically into the row:
The activation applies sigmoid element-wise across the whole row:
Vectorized backward propagation
The single-example gradient $$dz^{(i)} = a^{(i)} - y^{(i)}$$ becomes one elementwise subtraction across the whole dataset:
The weight gradient — previously an accumulation loop $$dw\mathrel{+}=x^{(i)}dz^{(i)}$$ followed by a divide — is a single matrix product:
And the bias gradient is just the averaged sum:
Both for-loops are gone, replaced by four array operations the hardware executes in parallel.
10 Implementation — NumPy & PyTorch
The whole training loop in NumPy
Every equation above collapses into a tight loop over gradient-descent iterations. The inner loops over examples and features have completely vanished:
That is a complete, working classifier — built from nothing but the dot product, the sigmoid, the log loss, and a derivative.
The same model in PyTorch
PyTorch hides the backward pass behind autograd — the computation graph from Section 6, built and differentiated automatically. You write only the forward pass; loss.backward() fills in every gradient:
nn.Linear is $$w^Tx + b$$. BCEWithLogitsLoss fuses the sigmoid and the log loss — and applies the $$dz = a - y$$ simplification internally for numerical stability. loss.backward() is the backward pass through the computation graph. optimizer.step() is the gradient-descent update. Nothing new — just automated.
Where this goes next
Swap nn.Linear(n_x, 1) for a stack of linear layers with non-linearities between them and you have a deep neural network. Replace the dense layers with convolutions and you have a vision model. Replace them with attention and you have a transformer. In every case the four verbs are unchanged: score with weights and a bias, squash with a non-linearity, measure with a loss, and descend the gradient.
That's why logistic regression is worth this much attention. It isn't a stepping stone you leave behind — it's the atom every larger model is built from.