Introduction to Sequential Prediction
The fundamental problem of predicting what comes next—from compression to language models
From Solomonoff induction to modern language models—exploring sequential prediction through information theory, Bayesian methods, and neural approaches
What comes next?
This question drives one of the most fundamental problems in machine learning and information theory. Given a sequence x₁, x₂, …, xₙ₋₁, how do we predict xₙ? The answer connects Shannon’s information theory to Bayesian inference to the transformer architectures powering modern AI.
This series traces the evolution of sequential prediction from its theoretical foundations to practical implementations:
Foundations (2010-2016): We begin with the formal problem definition and Solomonoff induction—the theoretical ideal that provides the gold standard for prediction. From there, we develop the Bayesian framework that underlies all principled approaches, explore the data generating processes that create sequences, and examine classical n-gram models.
Modern Perspectives (2024-2025): The later posts bring these classical ideas into conversation with modern neural language models. We compare CTW’s sample efficiency with transformers’ flexibility, and conclude with experimental validation of Context Tree Weighting on structured binary sequences.
This series is accompanied by working code:
from experiments.ctw import CTW, generate_sequence
# Generate from a 3rd-order Markov process
seq = generate_sequence(10000, seed=42)
# CTW predictor with depth 3
ctw = CTW(max_depth=3)
accuracy = ctw.accuracy(seq)
print(f"Accuracy: {accuracy:.1%}")
Sequential prediction isn’t just an academic exercise. It’s the core problem underlying:
Understanding the foundations—from Kolmogorov complexity to Bayesian model averaging—provides deep insight into why modern AI systems work and where their guarantees come from.
A particular focus of this series is Context Tree Weighting (Willems et al., 1995), an algorithm that:
Our experiments confirm the theory: when CTW’s depth matches or exceeds the true order of a Markov source, it achieves the theoretical maximum accuracy—the Bayes-optimal prediction rate.
The fundamental problem of predicting what comes next—from compression to language models
Why the optimal predictor is incomputable—and what we can learn from it anyway
Model averaging over hypotheses—the principled approach to handling uncertainty in prediction
Markov processes and tree sources—understanding where sequences come from
The classical approach to sequence prediction—counting and smoothing
Trade-offs between sample efficiency, scalability, and theoretical guarantees
The evolution of neural sequence prediction—and how it connects to classical methods
Validating Context Tree Weighting through experiments—including a bug that changed everything