Multimodal models

Deep-dive code on GitHub →

The main spine of this book is text-centric, mostly because the most legible scientific progress in the post-2020 era has been on language models. The reality of deployed AI in 2026 is much more multimodal: production systems consume images and text together, audio and text together, video, screen captures, and increasingly all of these in real time. This chapter is a launching point into the multimodal frontier.

VLMs: vision-language models, demystified

The basic vision-language model architecture is surprisingly simple to state. Take a transformer that you would otherwise use as a pure language model. Then, when a user provides an image, cut the image up into patches, project each patch into the model’s embedding space, and concatenate those patch-embeddings into the token stream, often with a marker token indicating where the image starts and ends. The transformer treats the image patches as just more tokens, its attention machinery can mix between text tokens and image tokens freely.

That is essentially the whole architectural trick at the level a textbook reader needs. The detailed design space is rich (which vision encoder to use upstream, whether to share or separate parameters across modalities, how to resample images of arbitrary aspect ratios, how to handle high-resolution images that would produce too many patches), but the conceptual core is “images become tokens.”

The reason this works at all is that the transformer’s content-based attention does not really care what the tokens are. A patch embedding and a word embedding are both just vectors. As long as the joint training distribution has enough cross-modal supervision, image captions, visual question answering, OCR pairs, instruction-following with images, the model learns to use the same attention heads to relate text and image content. This is one of the reasons the transformer has held its position as the architectural substrate of frontier models: it generalizes to new modalities by embedding them, not by re-engineering its core.

Beyond images and text

The story gets richer when you move beyond images.

Speech models. Audio is a continuous signal, which makes it less natural to tokenize than images. Two main strategies have emerged. One is to learn a discrete audio tokenizer (a vector-quantized model) that turns audio into a sequence of discrete codes the way BPE turns text into tokens, then use a standard autoregressive transformer over those codes. The other is to feed continuous audio features directly into a model with an architecture adapted to handle them. Both approaches have produced strong speech models in different settings.

Real-time grounded models. Increasingly, frontier multimodal systems are deployed in settings where they must consume streaming multimodal input, audio coming in continuously, video frames arriving at 30 Hz, possibly with the system generating output in real time as well. These systems push hard against the latency-throughput tradeoffs of the underlying serving stack (see the Systems deep dive) and introduce new architectural concerns: how do you do attention over a stream that has no fixed end? How do you generate output incrementally without waiting for the input to finish? Real-time multimodal is one of the most active engineering frontiers in 2026.

Video and embodied models. Video adds a temporal dimension on top of the image structure. Models that consume video at scale typically use spatiotemporal patching (patches over both space and time) and various compressions that throw away redundant temporal information. The frontier here is moving fast and the architectural choices are not yet settled.

Omni-models. Models that consume and produce any modality, text, image, audio, video, via a unified token space. The architectural commitment is the same as VLMs: project everything into the embedding space and let the transformer treat it all as tokens. The training-distribution challenges (where do you get billions of aligned multimodal examples?) and the evaluation challenges (how do you measure quality across all combinations of input and output modalities?) are non-trivial.

What this chapter does not claim

Some honest caveats:

“Multimodal capability” varies enormously by task. A frontier VLM may be excellent at image captioning and bad at fine-grained spatial reasoning. The benchmarks do not capture this asymmetry well.
The most ambitious claims about multimodal models, that they have a unified world representation across modalities, that they reason about physical scenes, that their language outputs are grounded in their visual inputs, are partially true under specific conditions and partially overclaimed.
The science of multimodal models, in the sense the main spine cares about, is much less developed than the science of text-only LLMs. The phenomena are less well characterized, the model-organism studies less mature.

Where this fits in the broader picture

Multimodal models connect back to several other deep dives. Data-centric ML is load-bearing, multimodal training data is harder to collect and clean than pure text, and the quality of the data shapes the result. Interpretability in multimodal models is a research frontier, with new questions about how visual and linguistic concepts are jointly represented. The Vision-Language Models Inherit Human Color Perception result (Park et al., ICLR 2026 workshop), referenced in Chapter 11, is one example of the kind of phenomenology multimodal models invite.

Where to go next

Recent VLM papers and technical reports (frontier-lab system cards are useful).
Speech-model technical reports.
The growing literature on streaming and real-time multimodal systems.
Audio-model open-source releases (their architecture docs are reading-list-worthy).

If the main book convinced you the science of deep learning is about language models, this chapter is the reminder that “language models” is too narrow a frame for where the frontier already is.