Skip to main content

The Learning Problem: Recommended Reading

Recommended Reading

This list extends The Learning Problem series with the formal theory that the essays summarize and apply. Entries marked ✦ are the works I would hand someone starting out; the rest is depth.

The sections mirror the series’ argument: Solomonoff’s optimal but incomputable starting point, then Bayesian inference as the tractable principled framework, then the statistical learning theory that bounds what is actually learnable, then MDL as the compression-prediction connection, then online learning for the adversarial-sequence setting.

Universal Induction

The theoretical optimum, what it means, and why you cannot run it.

  • A Formal Theory of Inductive Inference by Solomonoff (1964) [paper] ✦. Parts I and II. The original Solomonoff induction paper. Incomputable but optimal under every reasonable Bayes loss. Part I, Part II.
  • Universal Artificial Intelligence by Hutter (2005) [book] ✦. AIXI: the formal universal-agent construction, with rigorous convergence and optimality analysis. Publisher.
  • An Introduction to Kolmogorov Complexity and Its Applications by Li, Vitányi (2019, 4th ed.) [book] ✦. The reference for algorithmic information theory, Kolmogorov complexity, incompressibility, and the Solomonoff prior.
  • A Theory of Universal Artificial Intelligence Based on Algorithmic Complexity by Hutter (2000) [paper]. Shorter AIXI derivation if you want the machinery without the full book.

Bayesian Foundations

The principled framework every practical learning algorithm approximates.

  • Probability Theory: The Logic of Science by Jaynes (2003) [book] ✦. Bayesian reasoning derived from desiderata, not assumed. The epistemological spine under every informal “all induction is Bayesian” claim.
  • Information Theory, Inference, and Learning Algorithms by MacKay (2003) [book] ✦. Bayes, information theory, and neural networks together. Free PDF from the author. Inference Group.
  • Bayesian Data Analysis by Gelman, Carlin, Stern, Dunson, Vehtari, Rubin (2013, 3rd ed.) [book]. The applied-modeling counterpart to Jaynes.
  • Pattern Recognition and Machine Learning by Bishop (2006) [book]. The graduate textbook that taught a generation the Bayesian view of ML. Now free from the publisher.

Statistical Learning Theory

What is learnable from finite samples, under what assumptions, with what sample complexity.

  • A Theory of the Learnable by Valiant (1984) [paper] ✦. The PAC-learning framework. CACM 27(11).
  • The Nature of Statistical Learning Theory by Vapnik (1998, 2nd ed.) [book] ✦. VC dimension, structural risk minimization, the statistical foundations of SVMs.
  • Understanding Machine Learning: From Theory to Algorithms by Shalev-Shwartz, Ben-David (2014) [book]. Modern graduate text; best single source for PAC, VC, Rademacher, online learning. Free PDF from the authors.
  • Patterns, Predictions, and Actions by Hardt, Recht (2022) [book]. Short, modern, covers learning theory and the fairness/causality machinery that older texts skip. Free online.
  • Foundations of Machine Learning by Mohri, Rostamizadeh, Talwalkar (2018, 2nd ed.) [book]. The standard reference for formal generalization bounds.

Minimum Description Length

Prediction as compression, compression as prediction. The practical cousin of Solomonoff.

  • Modeling by Shortest Data Description by Rissanen (1978) [paper] ✦. The original MDL paper. Automatica 14.
  • The Minimum Description Length Principle by Grünwald (2007) [book] ✦. The comprehensive modern treatment.
  • Stochastic Complexity in Statistical Inquiry by Rissanen (1989) [book]. Rissanen’s own development of the idea.

Online Learning and Prediction

The adversarial-sequence setting: no i.i.d. assumption, just regret bounds against experts.

  • Prediction, Learning, and Games by Cesa-Bianchi, Lugosi (2006) [book] ✦. The reference for online learning, prediction with expert advice, and online convex optimization.
  • The Weighted Majority Algorithm by Littlestone, Warmuth (1994) [paper]. The foundational no-regret algorithm. Information and Computation 108.
  • Online Learning and Online Convex Optimization by Shalev-Shwartz (2011) [paper]. Good survey.

How this list is opinionated

The thread: all learning is constrained approximate inference, the constraints come from computation, representation, and prior, and the theoretical limit (Solomonoff) is what every practical method is approximating. The sections are the four main ways the field has made that approximation tractable: Bayesian computation, sample-complexity theory, compression, and adversarial regret bounds.

Excluded on purpose: deep-learning-specific empirical papers (the series is about the learning problem, not the learning algorithms of the month), most of the causality literature (a different problem), and optimization theory (adjacent but not the bottleneck).

If you read three things first, read Solomonoff 1964 Part I, Jaynes Chapter 1-2, and MacKay Chapter 28 (on Occam’s razor). Those three together motivate everything the series then explores.