2 Neural networks and image classification

Chapter code on GitHub →

Most introductions to modern AI start with large language models. We are going to start with image classification instead. There are two reasons. The first is historical: the deep learning revolution kicked off in vision in 2012, and the patterns the field worked out there, depth, scale, normalization, residual connections, GPU-friendly architectures, are recapitulated in everything that comes later. The second is pedagogical: image classification is the cleanest setting to actually introduce a neural network. You can see what the input is, you can see what the output is, you can see what the layers in the middle are doing.

We will skip classical machine learning entirely. SVMs, kNN, decision trees, kernel methods, these are covered well in other books and they are no longer where the interesting questions live in modern AI. The historical thread that matters is the one that goes neural-networks → convolutional neural networks → ImageNet → AlexNet → ResNet → vision transformers.

One question will hang over the chapter: “why did it take so long?” Neural networks were proposed in the 1940s and 60s. They beat hand-crafted features on ImageNet only in 2012. What changed in between, depth, scale, normalization, initialization, residual connections, GPU-amenable architectures, is the actual subject matter of the chapter.

Neural networks, properly

A neural network is a parameterized differentiable function f_\theta: \mathcal{X} \to \mathcal{Y} trained by gradient descent on some loss. You know this. What is worth doing carefully here is naming the ingredients that the field had to discover before deep learning actually worked.

A multi-layer perceptron (MLP) is the simplest case: an alternating stack of affine transformations and elementwise nonlinearities, h_{\ell+1} = \sigma(W_\ell h_\ell + b_\ell), with the output produced from the final hidden state by another linear projection. The universal approximation theorem says that for any reasonable function on a compact domain, you can find an MLP that approximates it arbitrarily well by making it wide enough. As discussed in the previous chapter, this theorem is correct and largely useless, it tells you nothing about how to find the weights, how deep you need to go, or what inductive biases matter. It is a good example of theory being technically true and practically misleading.

The actual machinery that makes deep networks work, in the order it had to be invented:

Stochastic gradient descent. You estimate the loss gradient on a minibatch of examples and step. The noise is not a bug, it acts as an implicit regularizer and helps you escape sharp minima. For classification, the loss is cross-entropy, \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta(y_i \mid x_i), which is just the negative log-likelihood under the model.

Optimizers. Vanilla SGD works but is slow and finicky. Adding momentum smooths the trajectory. Adam adapts the step size per parameter using running estimates of first and second gradient moments. AdamW decouples weight decay from the adaptive step. By default you reach for AdamW; you can get fancier when you need to.

Initialization. This was a real bottleneck for a long time. If you initialize the weights too small, signals vanish through the depth; too large, they explode. The Glorot and He schemes set the variance of each layer’s weights so that the activations and gradients keep a roughly stable scale across depth. Without this, deep networks simply did not train.

Normalization layers. Batch normalization made deeper networks trainable by stabilizing the distribution of activations across the batch. Layer normalization replaced it in most modern architectures because it does not depend on batch size, which matters for sequence models and small-batch training. The point of normalization layers is not theoretical elegance, it is that without them, the optimization problem is much harder.

Residual connections. A residual block computes x \mapsto x + F(x). This is the move that unlocked very deep networks. It changes the inductive bias from “learn the function” to “learn the correction from the function so far,” and it lets gradients flow through the depth without vanishing. The same idea, a residual stream that components read from and write to, is the computational substrate of the transformer in Chapter 4.

Hyperparameters that matter. Learning rate is the one you will tune most. Batch size interacts with learning rate (larger batches generally tolerate larger steps). Weight decay regularizes implicitly. Dropout is less universal than it once was but still useful. The general advice is: tune learning rate first, batch size second, everything else later.

Why GPUs. Neural networks are matrix multiplications and elementwise operations on tensors. GPUs are devices that do exactly that, very fast, in parallel. CPUs are not. Almost every important architectural choice in modern deep learning is a choice about what is easy to compute efficiently on a GPU, this is where the “bitter lesson” of Chapter 1 cashes out architecturally.

[Plot] A schematic showing depth on the x-axis and “trainable / not trainable” on the y-axis, with the introduction of (initialization, normalization, residual connections) shifting the trainability frontier deeper. Not real numbers, a conceptual diagram of which inventions unlocked which depth regime.

Image classification, specifically

Image classification is the task: given an image, output a category label. The history is dense, the architectures rhyme, and a small number of ideas do all the work.

Convolutional neural networks (CNNs) encode a powerful inductive bias: translation equivariance and locality. A convolution slides the same small filter over the entire image, so a feature detected in the top-left and the bottom-right is detected by the same weights. This is the canonical example of building a prior into the architecture. It is also a useful instance of a recurring tension: priors save you data when they are right, and hurt you when they are wrong. Vision transformers, later, drop the prior and recover the performance from data.

ImageNet is the dataset that organized the field. Roughly a million labeled images across a thousand object categories, with a competition that ran from 2010. It mattered because everyone trained on the same data and could compare numbers directly. The history of image classification on ImageNet is in some sense the history of deep learning kicking off.

The historical arc.

AlexNet showed that a deep CNN trained on GPUs could beat hand-engineered features by a wide margin on ImageNet. This is the year deep learning “started working” by everyone’s reckoning.
VGG demonstrated that deeper but simpler (uniform 3×3 convolutions, more layers) was a productive direction.
ResNet introduced residual connections at scale and made networks with many tens or hundreds of layers trainable. This is the architecture that pushed image classification past human-level accuracy on ImageNet.
Vision transformers (ViT) revisited the whole stack later, dropping convolutions in favor of attention on image patches. They needed more data to compensate for the weaker inductive bias, but at scale they matched or beat CNNs. The deeper machinery is in Chapter 4.

The arc is not random. Each step swapped a hand-engineered choice for a more general one and let scale do the work. That is the bitter lesson in image-classification clothes.

Inductive bias vs compression

The deeper question lurking under image classification is: what does a trained network actually learn?

Two framings, both useful:

Inductive bias, the architecture and training procedure prefer certain functions over others. CNNs prefer translation-equivariant, hierarchical, locally structured functions. This is prior knowledge baked into the model.
Compression, training pushes the network to find a compact representation of the data that preserves task-relevant structure. Useful features are reused across many examples; useless ones get pruned.

These framings are not in opposition. The inductive bias decides what kind of compressions are accessible; training decides which one the network actually finds. For a physicist, this is reminiscent of the way symmetries in a physical system constrain the form of solutions without dictating which solution the system finds.

The first bitter lesson: bigger is better

Long before “scaling laws” became a coherent research subfield, image classification had already absorbed its own preview of the lesson: bigger networks, trained on more data with more compute, were just better. The improvement was not marginal and not noisy, it was consistent across years, across architectures, and across teams.

This is the first taste of a pattern that gets the full treatment in Chapter 5, where we look at the actual functional form of the loss as a function of compute, data, and parameters [1], [2]. For now, just note the pattern: in image classification, the people who won the benchmark were the people who trained bigger networks on more data with better hardware. The architectural cleverness mattered, but it mattered less than the scaling.

That is the cliffhanger for the next several chapters: if scale is doing this much work, what exactly is it doing, and what is left for science to do once scale has done its part? That is the entire question of this book.