9 Reasoning and test-time compute

Until 2023, the way to make an LLM smarter was to make it bigger. In 2024–2025 a second axis opened up: spend more compute at inference time. A small model that is allowed to “think” for ten thousand tokens before answering can outperform a much larger model that answers immediately.

This chapter is about how that works and why it changed the trajectory of the field.

9.1 Chain-of-thought

The seed of the idea is chain-of-thought (CoT) prompting: ask the model to “think step by step” before giving a final answer. Empirically, simply allocating more tokens before the answer improves accuracy on math, logic, and multi-step reasoning, even without any training change. The model has learned, from pretraining, that texts containing reasoning steps tend to reach correct conclusions, and prompting it to produce such text routes the computation through that distribution.

This observation alone — that there is an entire dimension of capability per query the model is not using by default — was enough to motivate the next step.

9.2 Reasoning models

A reasoning model is one that has been post-trained, typically with RLVR, to produce long chains of thought before answering. The training objective is something like:

Take a problem with a known answer.
Sample a long completion (often 10^3–10^5 tokens) ending in a final answer.
Reward = 1 if the final answer is correct, 0 otherwise.
Optimize the policy with GRPO or PPO against this reward.

Crucially, the model is not supervised on the intermediate reasoning steps — only on whether the final answer is right. The model invents whatever reasoning trajectory works. The resulting traces often look strange: backtracking (“Wait, that’s not right, let me reconsider…”), self-questioning, exploring multiple branches, sometimes thousands of tokens of dead-end calculation before the right answer.

9.3 Why this is a different kind of scaling

For pretraining, capability scales as a power law in training compute. For reasoning, an analogous power-law-shaped curve appears in inference compute for a fixed model: accuracy on hard problems improves smoothly as you let the model generate more tokens per problem.

This is qualitatively new. It means there are two knobs to spend on:

Train a bigger model on more data (more train compute).
Let the existing model think for longer (more inference compute).

These trade off. For many domains, a small reasoning-trained model with a large test-time budget is cheaper per-correct-answer than a giant model that one-shots the answer. This has reshaped how labs allocate compute.

9.4 Self-consistency and majority vote

A simpler form of test-time scaling, predating reasoning training: sample k independent answers and take the majority vote (for problems with verifiable answers) or the most common rationale (for open-ended). Accuracy improves with k. This is essentially the same idea — spend more inference compute, get a better answer — without changing the model.

9.5 Verifiers and search

Beyond raw sampling, a model can use a verifier to score its own outputs:

For math: a separate “process reward model” or step-checker.
For code: literally running the code against test cases.
For proofs: a formal proof assistant.

With a verifier you can do tree search, beam search, or best-of-N sampling. This is one of the main reasons code and math benchmarks have moved faster than other domains in 2024–2026: the verifier is cheap and reliable.

9.6 What this changed

Three things, broadly:

The frontier moved. Problems that were 30% solvable in 2023 (competition math, hard code, scientific reasoning) became 80–95% solvable in 2025 — almost entirely from RL-on-reasoning, not from bigger pretraining.
Inference cost matters more. A frontier reasoning model can spend 10^4–10^6 tokens per query. Serving them efficiently is now a first-class problem (see Section 11.1).
The unit of “model capability” is no longer just parameters. It is parameters \times token budget. Reporting one without the other is increasingly meaningless.

9.7 Open questions

Two big ones, as of 2026:

How far does this scale on non-verifiable tasks? Verifiable rewards (math, code) cleanly drove the gains. For open-ended reasoning (research, persuasion, planning) it is less clear what the reward signal is, and progress is correspondingly slower.
What is actually happening in the long thinking traces? Some of it is genuine search; some of it looks like the model performing for its training reward in ways that may or may not generalize. The interpretability work here is active and unsettled.