Sequential Prediction

From Solomonoff induction to modern language models—exploring sequential prediction through information theory, Bayesian methods, and neural approaches

8 parts

What comes next?

This question drives one of the most fundamental problems in machine learning and information theory. Given a sequence x₁, x₂, …, xₙ₋₁, how do we predict xₙ? The answer connects Shannon’s information theory to Bayesian inference to the transformer architectures powering modern AI.

The Journey

This series traces the evolution of sequential prediction from its theoretical foundations to practical implementations:

Foundations (2010-2016): We begin with the formal problem definition and Solomonoff induction—the theoretical ideal that provides the gold standard for prediction. From there, we develop the Bayesian framework that underlies all principled approaches, explore the data generating processes that create sequences, and examine classical n-gram models.

Modern Perspectives (2024-2025): The later posts bring these classical ideas into conversation with modern neural language models. We compare CTW’s sample efficiency with transformers’ flexibility, and conclude with experimental validation of Context Tree Weighting on structured binary sequences.

What You’ll Learn

Information-theoretic foundations: The deep connection between prediction and compression
Solomonoff induction: Why the theoretically optimal predictor is incomputable
Bayesian model averaging: How CTW achieves optimality for tree sources
The KT estimator: The mathematical engine behind CTW’s predictions
N-gram models to transformers: The evolution from counting to attention
Practical trade-offs: When to use CTW vs. neural approaches

The Code

This series is accompanied by working code:

CTW implementation: Pure Python Context Tree Weighting with O(depth) updates
Markov generators: Configurable data generating processes for experiments
Experiments: Reproduce results showing CTW achieves Bayes-optimal accuracy

from experiments.ctw import CTW, generate_sequence

# Generate from a 3rd-order Markov process
seq = generate_sequence(10000, seed=42)

# CTW predictor with depth 3
ctw = CTW(max_depth=3)
accuracy = ctw.accuracy(seq)
print(f"Accuracy: {accuracy:.1%}")

Why This Matters

Sequential prediction isn’t just an academic exercise. It’s the core problem underlying:

Compression: Better predictions yield shorter codes (arithmetic coding)
Language models: GPT, Claude, and every LLM is fundamentally a next-token predictor
Anomaly detection: Surprising events have low predicted probability
Reinforcement learning: World models predict state transitions

Understanding the foundations—from Kolmogorov complexity to Bayesian model averaging—provides deep insight into why modern AI systems work and where their guarantees come from.

The CTW Story

A particular focus of this series is Context Tree Weighting (Willems et al., 1995), an algorithm that:

Performs exact Bayesian model averaging over all tree structures
Achieves provable optimality for tree sources
Uses minimal compute (O(depth) per symbol)
Requires no hyperparameter tuning

Our experiments confirm the theory: when CTW’s depth matches or exceeds the true order of a Markov source, it achieves the theoretical maximum accuracy—the Bayes-optimal prediction rate.