Sequential Prediction: Recommended Reading

April 24, 2026 4 min read Updated: April 26, 2026

Recommended Reading

This list extends the Sequential Prediction series with the formal theory behind predicting xₙ from x₁..xₙ₋₁. Entries marked ✦ are the works I would hand someone starting out; the rest is depth.

The sections mirror the series’ arc: information-theoretic foundations first (prediction equals compression), then the CTW line specifically, then universal prediction (the Solomonoff connection), then modern neural sequence models.

Information Theory

The framework that unifies compression, coding, and prediction.

A Mathematical Theory of Communication by Shannon (1948) [paper] ✦. The founding paper. Entropy, channel capacity, source coding. Bell System Technical Journal 27. PDF.
Elements of Information Theory by Cover, Thomas (2006, 2nd ed.) [book] ✦. The standard graduate text. Entropy, mutual information, channel coding, source coding.
Information Theory, Inference, and Learning Algorithms by MacKay (2003) [book] ✦. Free PDF from the author; pairs information theory with Bayesian ML. Inference Group.

Context-Tree Weighting

The specific algorithm the series focuses on, and its immediate lineage.

The Context-Tree Weighting Method: Basic Properties by Willems, Shtarkov, Tjalkens (1995) [paper] ✦. The CTW paper. Exact Bayesian model averaging over tree sources, provable redundancy bounds. IEEE Trans. Information Theory 41(3).
The Context-Tree Weighting Method: Extensions by Willems (1998) [paper]. Beta versions of the estimator, longer contexts. IEEE Trans. Information Theory 44.
The Performance of Universal Source Coding Methods by Rissanen (1984) [paper]. The redundancy bounds CTW achieves.
The Minimum Description Length Principle by Grünwald (2007) [book]. CTW is an MDL method at heart.

Universal Prediction

The optimal-but-incomputable ideal, and its computable approximations.

A Formal Theory of Inductive Inference by Solomonoff (1964) [paper] ✦. Solomonoff induction as the universal predictor. Parts I and II.
Universal Artificial Intelligence by Hutter (2005) [book] ✦. AIXI: Solomonoff induction plus utility maximization. Publisher.
An Introduction to Kolmogorov Complexity and Its Applications by Li, Vitányi (2019, 4th ed.) [book]. The algorithmic-information-theory reference.
Optimal Sequential Prediction in Unknown Sparse Environments by Vovk (various) [paper]. Vovk’s prediction-with-experts line generalizes CTW-style regret bounds.

Data Compression

The practical side of the prediction-compression connection.

A Universal Algorithm for Sequential Data Compression by Ziv, Lempel (1977) [paper] ✦. LZ77, the ancestor of most modern general-purpose compressors. IEEE Trans. Information Theory 23(3).
Compression of Individual Sequences via Variable-Rate Coding by Ziv, Lempel (1978) [paper]. LZ78.
Arithmetic Coding for Data Compression by Witten, Neal, Cleary (1987) [paper]. The practical implementation of arithmetic coding, the bridge from probabilistic prediction to bits on disk. CACM 30(6).

Neural Sequence Modeling

The modern counterpart: expressive models, less theory, sometimes better in practice.

A Neural Probabilistic Language Model by Bengio, Ducharme, Vincent, Jauvin (2003) [paper] ✦. The paper that introduced neural language models. JMLR 3.
Attention Is All You Need by Vaswani, Shazeer, Parmar, et al. (2017) [paper] ✦. The transformer paper. arXiv.
Language Models are Few-Shot Learners by Brown, Mann, Ryder, et al. (2020) [paper]. GPT-3. arXiv.
Scaling Laws for Neural Language Models by Kaplan, McCandlish, Henighan, et al. (2020) [paper]. Empirical study of loss as a function of compute, parameters, and data. arXiv.
An Image is Worth 16x16 Words by Dosovitskiy, Beyer, Kolesnikov, et al. (2021) [paper]. ViT: the transformer generalized beyond text. Useful for seeing “sequential prediction” more broadly. arXiv.

How this list is opinionated

The thread: prediction and compression are the same problem, the theoretical limit is universal (Solomonoff), practical methods approximate it with specific priors, and the most successful modern approximations are neural models that trade guarantees for expressiveness. Works that illuminate that thread are in.

Excluded on purpose: most of the classical statistical time-series literature (different assumptions), the signal-processing-oriented prediction literature (not the sequential-decision framing the series uses), and deep-learning-specific tuning papers (the series is about the problem, not the hyperparameters).

If you read three things first, read Shannon 1948, Willems et al. 1995 CTW, and the “Attention Is All You Need” paper. The arc from Shannon to CTW to transformers is the whole story compressed.