Theoretical foundations of deep learning

Deep-dive code on GitHub →

The main spine of this book deliberately treats neural networks as objects of phenomenology, things to study experimentally, with model-organism methodology. The theoretical-foundations field takes the opposite vantage point: it asks what one can prove about networks as mathematical objects, and how to derive the qualitative behavior we observe from analyzable limits and structural arguments.

This is a serious technical field with its own toolkit, its own conferences, and its own intellectual culture. A short chapter cannot do justice to it. What follows is a map.

What the field studies

A non-exhaustive list of central questions:

Generalization: why do overparameterized networks not overfit? What controls the effective complexity of the function class realized by a trained network?
Optimization landscape: what is the geometry of the loss surface? Why does gradient descent find good minima? When and why does the implicit bias of SGD matter more than the explicit objective?
Representational capacity: what functions can a network of given width and depth realize, and how does depth interact with width?
Infinite-width limits: the neural tangent kernel (NTK) describes networks of infinite width as Gaussian processes with a particular kernel, making training dynamics analytically tractable. The infinite-width limit clarifies what initialization does, what feature learning requires (and why NTK does not capture it), and how scaling affects representational dynamics.
High-dimensional geometry: capacity arguments inspired by statistical mechanics, neural manifold capacity, replica calculations, mean-field analyses, describe how representations partition high-dimensional space.
Sharpness, flatness, and implicit regularization: characterizations of the local geometry of solutions found by SGD, and the connection between flatness and generalization.
Lottery tickets and sparsity: why do sparse subnetworks at initialization train as well as the dense network? What does this say about overparameterization?
Sample complexity bounds: PAC-Bayes, Rademacher complexity, and other classical tools applied to deep networks, with their characteristic looseness on modern scales.

What kind of training the field requires

Most foundational work draws on linear algebra and functional analysis (the heavy machinery for capacity and representation), probability theory (for generalization bounds and concentration arguments), optimization theory (for landscape analysis), and increasingly statistical physics (for the high-dimensional and replica-theoretic results that have proved especially productive on neural networks). Anyone moving deep into the field benefits from a working fluency with random matrix theory, the geometry of high-dimensional probability, and information theory.

The cultural overlap with statistical physics is real and recent. Several of the most productive lines of work, capacity calculations, sharpness analyses, scaling-law derivations, were imported (or re-derived) by physicists who recognized neural networks as statistical-mechanical systems.

Where this field is, in 2026

The honest summary: the theoretical-foundations field has made real progress over the last decade, NTK has clarified what training dynamics look like in the wide limit, implicit regularization arguments have provided plausible accounts of why over-parametrized networks generalize, capacity calculations are increasingly quantitative. The field is also still incomplete in places that matter for practitioners: there is no fully predictive theory of feature learning at finite width, no general theory of why scaling laws have the empirical exponents they do, and no settled account of emergence.

This is the natural counterpart course to the experimental approach taken in the main spine. The two are complementary, not opposed: phenomenology surfaces the puzzles, theory closes them. A reader who has gone through the experimentalist’s chapters will recognize many of the open questions theory is currently trying to settle.

Where to go next

Look for:

Graduate-level courses on “Mathematics of deep learning” or “Statistical learning theory”, these are the closest to a single-semester immersion.
Theory-focused workshops and conferences (e.g., COLT, the theory tracks at NeurIPS/ICML).
The classical statistical learning theory literature for background on PAC-Bayes, Rademacher complexity, and capacity bounds.
Recent reviews on NTK, infinite-width networks, and the neural-network/statistical-physics interface.

This chapter is a pointer. A real treatment is a course.