The sequential-prediction series is now a playlist. Eight episodes, animated and narrated, built from the same posts.
The series takes one question seriously: given everything you have seen so far, what comes next? It sounds almost too simple to carry any weight, and it turns out to be the foundation the whole language-model era stands on. A model that predicts the next token well enough, over a large enough stream, has had to learn grammar, facts, reasoning, and style along the way, because all of those are things you need in order to guess the next token and be right. Prediction is not a warm-up for the interesting behavior. Prediction is where the interesting behavior comes from.
The arc runs from the ideal to the practical. It starts with Solomonoff induction, the theoretically optimal predictor that weighs every hypothesis by its simplicity and is also uncomputable, so it tells you what perfect prediction would look like without ever letting you run it. From there the series works downward toward things you can actually build: the approximations, the counting models, the moves that trade a little optimality for the ability to compute at all, and finally the transformer, which is what happens when you throw enough scale and the right architecture at the same next-token objective. The endpoint is not a mystery once you have walked the path. It is the computable shadow of the ideal you started with.
This series is the companion to my book, On Intelligence and Its Specifications. The book is the long, careful version of the argument: what optimal prediction is, why we can only approximate it, and what that gap means for the systems we are building now. The videos are the fast, visual on-ramp to the same ideas. If the playlist leaves you wanting the full treatment, the book is where it lives.
The playlist
Open the playlist on YouTube (8 episodes)
Episode list
- Why Predict the Next Symbol?
- Introduction to Sequential Prediction
- Solomonoff Induction: The Incomputable Ideal
- The Bayesian Prediction Framework
- N-gram Language Models: Counting and Smoothing
- Context Tree Weighting: Theory Meets Practice
- Neural Language Models: From RNNs to Transformers
- CTW vs. N-grams vs. Neural Language Models
Discussion