3 A deep learning primer
This chapter is for readers who have never trained a neural network. If you have, skip it.
3.1 A model is a parameterized function
A neural network is just a function f_\theta : \mathcal{X} \to \mathcal{Y} with a vector of parameters \theta \in \mathbb{R}^N. The number N can be anything from 10^3 to 10^{12}.
Training is the act of choosing \theta so that f_\theta does well on some objective. “Does well” is measured by a loss function \mathcal{L}(\theta), evaluated on a dataset: \mathcal{L}(\theta) = \frac{1}{|D|} \sum_{(x, y) \in D} \ell\big(f_\theta(x),\, y\big).
For regression \ell is often mean squared error; for classification it is cross-entropy; for language modeling it is the same cross-entropy applied to predicting the next token.
3.2 Gradient descent
We minimize \mathcal{L} by descending its gradient: \theta_{t+1} = \theta_t - \eta\, \nabla_\theta \mathcal{L}(\theta_t). In practice \mathcal{L} is estimated on a small minibatch of examples (stochastic gradient descent), and the bare update is replaced by an adaptive optimizer — usually Adam or its weight-decoupled variant AdamW. The optimizer state typically doubles or triples the per-parameter memory cost.
The learning rate \eta is the most important hyperparameter in deep learning. Almost all training pipelines schedule it: warmup linearly for a few thousand steps, then decay (cosine, linear, or constant-with-cooldown) over the run.
3.3 Backpropagation in one line
Computing \nabla_\theta \mathcal{L} exactly for a deep network is what backprop does. Forward pass: evaluate f_\theta(x) while saving intermediate activations. Backward pass: apply the chain rule layer by layer, multiplying Jacobians from output to input. The cost of the backward pass is a small constant times the forward pass — call it 2×. This is the only reason training deep networks is tractable.
3.4 What a layer is
For our purposes a layer is a parameterized map \mathbb{R}^d \to \mathbb{R}^{d'} that we can differentiate through. The three you must know:
- Linear: y = Wx + b.
- Nonlinearity: y_i = \sigma(x_i) for some elementwise \sigma (ReLU, GELU, SwiGLU).
- LayerNorm: y = \gamma \odot (x - \mu)/\sigma + \beta, with \mu, \sigma computed over the feature dimension.
The transformer (Section 4.1) is built almost entirely out of these three plus attention.
3.5 Training at scale
Three things make modern training different from the textbook version:
- Mixed precision. Weights are kept in 32-bit; activations and gradients in 16-bit (bf16, fp16) or 8-bit (fp8). Halves the memory cost and roughly doubles throughput on modern GPUs.
- Distributed training. The model and/or batch are split across many GPUs. Data parallelism (each GPU sees a different batch), tensor parallelism (each weight matrix is split across GPUs), and pipeline parallelism (different layers on different GPUs) are combined depending on model size.
- Gradient accumulation. Effective batch size can exceed what fits in memory by accumulating gradients across micro-batches.
Frontier training runs in 2024–2025 used \sim 10^4 – 10^5 GPUs for weeks. The infrastructure to make that not fall over is a substantial fraction of the engineering effort.
3.6 The rest of the book
Everything from here on assumes you are comfortable with: a model is a function with parameters; we train it by minibatch gradient descent on a loss; the only hard part is doing it at scale. If that sentence is fine, the rest of the book is downstream of it.