2 The modern AI landscape
2.1 Why this book exists
The shape of AI changed between 2017 and 2023. Before, the field was a zoo: convolutional networks for vision, recurrent networks for sequences, dozens of specialized architectures for individual tasks, and a healthy debate over which inductive biases mattered. After, a single recipe — a transformer trained on a lot of data with a lot of compute — was beating everything, in every modality, often by margins that no amount of clever architecture had previously achieved.
This is sometimes called the bitter lesson (Sutton 2019): methods that scale with computation reliably win over methods that encode human cleverness. The bitter lesson is not new, but the last few years made it inescapable.
2.2 What “AI” means in this book
When we say AI we will almost always mean some variation of:
A large neural network, trained by gradient descent on a self-supervised or supervised objective, possibly post-trained with reinforcement learning, deployed by sampling from its outputs.
That covers GPT-style language models, diffusion image models, code models, multimodal models, and most of what you have read about in the news. It does not cover classical symbolic AI, expert systems, or planners, which are largely separate traditions.
2.3 The three resources
Every modern AI system trades off three things:
- Compute (C) — floating-point operations spent during training, measured in FLOPs or A100/H100/B200-hours.
- Data (D) — number of tokens, images, or examples seen during training.
- Parameters (N) — size of the model.
The empirical claim of scaling laws is that loss decreases as a smooth power law in each of these, and that for a fixed compute budget there is an optimal (N, D) allocation. We will get to this in Section 5.1.
2.4 A physicist’s first reaction
Most physicists, on first contact, want to ask one of two questions:
- Is there a theory of why this works?
- Why are the architectures so ugly?
The honest answers are not really, yet and because we found them empirically. The field has theory — neural tangent kernels, mean-field analyses, feature-learning theory — but none of it predicts the empirical scaling curves quantitatively, and none of it told anyone to build a transformer. This book will mostly take the empirical view, because that is where the predictive power currently lives.
2.5 What follows
The next chapter is a short primer on deep learning for readers who have not done it before — what a model is, how gradient descent works, and how networks are trained at scale. Everything after that builds on it.