11 Inference

Training a model and serving a model are completely different problems. This chapter is about the second.

It matters because (a) it dominates the cost of any deployed system, often by 100× over its training cost amortized across queries; and (b) the techniques used in production now bleed back into training-time decisions (model shape, head count, attention variant).

11.1 What makes LLM inference different

Two phases per request:

Prefill. Run the prompt through the model once to populate the key-value cache. This is a single forward pass over T_{\text{prompt}} tokens — compute-bound, similar to training in cost structure.
Decode. Generate output tokens one at a time. Each step is a forward pass over one token, using the cached keys/values for all previous positions. This is memory-bandwidth-bound, not compute-bound, because you read all model weights from HBM to compute one token.

The asymmetry is the source of every interesting inference optimization. Modern GPUs are vastly over-provisioned for FLOPs relative to memory bandwidth during decode. Almost every technique below is about squeezing more useful work out of each weight read.

11.2 The KV cache

For a transformer of L layers with hidden dimension d, attending over T context tokens, the cache size per request is roughly \text{KV cache} \approx 2 \cdot L \cdot T \cdot d \cdot \text{bytes per element}. For a 70B model with 80 layers, d = 8192, fp16, and a 32k-context window, this is \sim 80 GB per request. The KV cache often dwarfs the model weights themselves at long context.

Consequences:

Grouped-query attention (GQA) and multi-query attention (MQA) share keys/values across multiple query heads, shrinking the cache by 4–8×.
Sliding window attention caps how far back each token attends, making the cache size constant in T rather than linear.
KV cache quantization to int8 or int4 is now standard.

11.3 Batching, especially continuous batching

GPUs reach their efficiency only when many requests are processed in parallel. Static batching — wait for N requests, run them together — wastes time because requests have wildly different lengths. Continuous batching (a.k.a. inflight batching) instead schedules generation step-by-step: as soon as a request finishes, slot a new one in.

This is the single most important serving optimization. The throughput improvement over naive serving is typically 5–20×.

11.4 PagedAttention

The KV cache for many requests of varying lengths fragments GPU memory badly. PagedAttention (vLLM) borrows the idea of virtual memory pages: store the cache in fixed-size blocks with a logical-to-physical mapping. Zero fragmentation, easy to share cache blocks across requests with the same prefix.

This is why vLLM became the dominant open-source serving stack — PagedAttention plus continuous batching gives you most of the throughput of a hand-tuned inference engine for free.

11.5 Speculative decoding

Use a small draft model to generate k candidate tokens, then verify them with one forward pass of the big target model. If the target agrees with the first j drafts, accept all j and move on. If not, accept up to the first disagreement and use the target’s prediction for the next token.

This is exact — the output distribution is unchanged — and the speedup is bounded by how often the draft model agrees with the target. In practice 2–4× faster decoding on standard hardware. Variants include medusa heads (the draft is a small parallel head on the target itself), lookahead decoding, and EAGLE.

11.6 Quantization

Run the model at lower precision than it was trained in. The interesting regimes for LLMs:

int8 weights, fp16 activations. Nearly free quality loss, ~2× memory and bandwidth savings. Standard.
int4 weights (GPTQ, AWQ). ~4× memory savings, small quality cost. Standard for serving on consumer hardware.
fp8 weights and activations. Native support on H100/B200, lets you cut inference cost roughly in half versus fp16 with negligible quality loss. Standard at frontier labs in 2025.
int4 / fp4 with KV cache quantization. The cutting edge; can sometimes hurt quality on long contexts.

11.7 Mixture of experts

Many frontier models use sparse mixture-of-experts (MoE): each MLP block has E “expert” sub-networks, and a router picks the top-k (k=2 typically) for each token. Total parameter count N_{\text{total}} is large, but active parameters per token N_{\text{active}} is much smaller (often 1/8 to 1/4 of the total).

For inference this is great in principle: only the active experts are used per token. In practice the routing is uneven, the cache requirements grow, and serving efficiently requires careful work. The win is real but smaller than the raw N_{\text{total}} / N_{\text{active}} ratio suggests.

11.8 Prefix caching

Many requests share the same long system prompt. If you can cache the KV state at the end of that prompt, every subsequent request that starts with it skips the prefill entirely. This is a 10–100× speedup for short queries against a long system prompt — common in RAG, agent loops, and chat — and is supported natively by most modern serving stacks.

11.9 Why this all matters

Reasoning models (Section 9.1) generate 10^4–10^6 tokens per query. Without continuous batching, PagedAttention, speculative decoding, and quantization, serving them at scale would not be economically viable. The inference stack and the training stack co-evolve: a new attention variant or a new precision format gets adopted because it makes serving cheaper, which means the model shape is downstream of inference economics, not just training economics.