6 Pretraining LLMs

“Pretraining” is the part where you take a freshly initialized transformer with hundreds of billions of parameters and feed it the internet, optimizing next-token cross-entropy until something interesting happens. This chapter is about what that process actually looks like.

6.1 The objective

Next-token prediction. Given a sequence of tokens x_1, \ldots, x_T, minimize \mathcal{L}(\theta) = -\sum_{i=1}^{T} \log p_\theta(x_i \mid x_{<i}). That is it. Every other capability of an LLM — translating, summarizing, writing code, doing arithmetic, reasoning — emerges as a side effect of being very good at this one task. Whether that should have been surprising in advance is a useful philosophical question that no one has a satisfying answer to.

6.2 Tokens, not characters

Models do not see raw text. They see sequences of tokens from a fixed vocabulary of \sim 50\text{k}–200\text{k} symbols, produced by a byte-pair encoding (BPE) or similar subword tokenizer. Roughly: common words are one token, rare words are split into a few tokens, and arbitrary byte sequences fall back to single-byte tokens. Typical English text averages \sim 0.75 tokens per word.

Tokenization is a leaky abstraction with real consequences:

Numerical reasoning is harder because numbers are tokenized inconsistently (“123” might be one token, “124” two).
Non-English languages can need 2–5× more tokens for the same content.
Code, with its many short symbols, tokenizes efficiently.

6.3 The data

A frontier pretraining dataset in 2025 contains \sim 10–30 trillion tokens. Composition varies but typically:

~60–80% general web text (Common Crawl, filtered aggressively)
~10–20% code (GitHub-derived)
~5–15% high-quality curated text (books, papers, Wikipedia)
~5–10% multilingual data
A growing fraction of synthetic data, generated by previous models

Filtering is where the alpha lives. The most important single piece of preprocessing is deduplication — both exact and approximate — which improves loss meaningfully for any fixed training budget. Quality classifiers (small models trained to predict whether a document looks like a “good” page) are used to upweight or downsample.

6.4 The compute

A frontier 2025 run is on the order of 10^{25}–10^{26} FLOPs, spread across 10^4–10^5 GPUs for weeks. At Chinchilla-optimal ratios (\sim 20 tokens per parameter) this corresponds to models in the hundreds of billions of parameters trained on tens of trillions of tokens; in practice models are smaller and trained longer because inference matters too.

A useful conversion: C \approx 6ND FLOPs of training compute for a dense transformer with N parameters and D tokens. Two FLOPs per multiply-add, factor of 3 for forward + backward + weight gradient.

6.5 Curriculum and data ordering

For a long time everyone assumed data order did not matter at scale; you just shuffled and trained. Recent work has complicated this picture: data order matters somewhat, especially in the final phase. Data ablations (training small models with subsets of the data and measuring loss) are now a standard tool for deciding what to include.

Many frontier runs use a two-phase schedule: a long “main” phase on diverse web data with cosine learning-rate decay, followed by a short “cooldown” or “annealing” phase on higher-quality data with the learning rate dropped near zero. The cooldown gives outsized impact for its compute cost.

6.6 What you get after pretraining

A model that is very good at predicting the next token on documents drawn from a distribution that looks like its training data. That is not the same thing as a useful assistant. The model will continue prompts, complete patterns, repeat itself, and confabulate fluently. It will not refuse harmful requests, follow instructions reliably, or know when to stop.

Turning a pretrained model into a usable product is the subject of Section 7.1 and Section 8.1.