Skip to main content

CTW Experimental Results: Theory Meets Practice

Throughout this series, I’ve developed the theory of sequential prediction, from Solomonoff’s incomputable ideal to CTW’s practical Bayesian approach. Now we validate theory with experiments.

The punchline: CTW achieves Bayes-optimal accuracy when its depth matches or exceeds the true Markov order. But I didn’t get there cleanly. A bug in my evaluation code initially suggested the opposite, and I nearly published the wrong conclusion.

The Experimental Setup

Data Generating Process

I use sparse Markov processes where most contexts are uninformative:

3rd-order Markov process:
Context  P(next=1)  Description
-------  ---------  ------------------
000      0.99       High certainty -> 1
100      0.99       High certainty -> 1
010      0.025      High certainty -> 0
101      0.025      High certainty -> 0
011      0.25       Moderate -> 0
110      0.25       Moderate -> 0
001      0.5        Uniform (random)
111      0.5        Uniform (random)

This process has 4 highly predictable contexts (about 99% or 97.5% certainty), 2 moderately predictable contexts (about 75% certainty), and 2 completely random contexts (50/50).

Theoretical Maximum Accuracy

An oracle predictor that knows the true transition probabilities achieves:

def theoretical_max(transition_probs, stationary_dist):
    """Predict the more likely outcome for each context."""
    return sum(
        stationary_dist[ctx] * max(p, 1-p)
        for ctx, p in transition_probs.items()
    )

For our process, this is approximately 95%. That’s the ceiling for any predictor.

Configuration

  • Training length: 50,000 bits
  • Test length: 10,000 bits
  • Seeds: 5 random seeds (averaged)
  • DGP orders: 1, 2, 3, 4, 5, 6
  • CTW depths: 1, 2, 3, 4, 5, 6, 8, 10

The Bug

Initial (Incorrect) Results

An early draft of this work reported what looked like a surprising finding:

Max DepthAccuracy (%)
187.2
296.8 (peak)
393.8
489.4
876.2
1266.4

This suggested that accuracy decreased when CTW depth exceeded the true order. I wrote up hypotheses about data sparsity degrading KT estimates at deep contexts, overfitting to rare deep patterns, and Bayesian dilution as the mixture flattened.

The Problem

These results were wrong. A bug in the evaluation code caused accuracy to degrade artificially at higher depths.

The Fix

After careful debugging, I corrected the context indexing in the test harness, fixed how predictions were accumulated across depths, and verified KT estimator updates were correctly propagated.

Corrected Results

With the bug fixed, the true behavior emerged:

DGP OrderTheo MaxD=1D=2D=3D=4D=5D=6
1~95%YYYYYY
2~95%NYYYYY
3~95%NNYYYY
4~95%NNNYYY
5~95%NNNNYY
6~95%NNNNNY

Y = Achieves theoretical maximum. N = Below theoretical maximum.

Key Finding: No Overfitting Penalty

When CTW depth is greater than or equal to the DGP order, CTW achieves Bayes-optimal accuracy.

The corrected experiments confirm three things:

  1. Depth < Order: Accuracy is limited because CTW can’t represent the true model.
  2. Depth = Order: Achieves theoretical maximum.
  3. Depth > Order: Also achieves theoretical maximum. No degradation.

The third point is the important one. Increasing depth beyond the true order does not hurt performance.

Why This Makes Sense

CTW’s Bayesian model averaging naturally handles excessive depth.

The prior penalizes depth. At each node, CTW assigns 0.5 weight to “stop here” (use local KT estimate) and 0.5 weight to “go deeper” (use children). This compounds: a depth-D path has prior weight roughly 0.5^D. Deep trees are automatically downweighted.

Sparse data at depth is handled. When deep contexts have few samples, the KT estimator gives high uncertainty, the Bayesian mixture shifts weight toward shallower estimates, and predictions remain calibrated.

Model averaging is robust. Even with depth set too high, the true model is in the mixture, the posterior concentrates on the right tree structure, and excess complexity is averaged away rather than memorized.

Practical Implications

Depth selection. When the true order is known, set depth equal to the order or slightly higher. When unknown, set depth conservatively high. There’s no penalty for overestimating.

Data requirements. Higher depths require more data to populate deep contexts, but CTW handles sparse contexts gracefully. Rule of thumb: you need roughly 10x as many samples as the number of contexts at the target depth.

Confidence in theory. The experiments validate CTW’s theoretical properties. Bayesian model averaging works as advertised. The 0.5/0.5 prior provides appropriate regularization. Optimality for tree sources holds empirically.

Lessons Learned

Trust but verify. The initial “surprising” result seemed plausible. Deep models overfitting is a common story. But it contradicted well-established theory, the magnitude of degradation was suspicious, and I skipped independent implementation checks. I should have been more skeptical of my own code.

Theory as a guide. When experiments contradict established theory, the experiment is often wrong. Theory provides expectations; deviations deserve scrutiny, not celebration.

Reproducibility matters. The bug was found when I tried to reproduce results with a different implementation. If I’d only run the original code, the error might have persisted.

Simple experiments can fail. This wasn’t a complex distributed system. It was a few hundred lines of Python evaluating a well-understood algorithm. Bugs hide everywhere.

Running the Experiments

To reproduce:

from experiments.ctw_multiorder import run_multiorder_experiment

results = run_multiorder_experiment(
    dgp_orders=[1, 2, 3, 4, 5, 6],
    ctw_depths=[1, 2, 3, 4, 5, 6, 8, 10],
    markov_type="sparse",
    train_length=50000,
    test_length=10000,
    num_seeds=5,
)

Or via command line:

# Quick sanity check
python -m experiments.ctw.scripts.ctw_multiorder --quick

# Full experiment
python -m experiments.ctw.scripts.ctw_multiorder

Summary of Results

ClaimStatus
CTW achieves theoretical max when depth >= orderConfirmed
CTW degrades with excessive depthRefuted (was a bug)
Bayesian model averaging provides robustnessConfirmed
CTW is optimal for tree sourcesConfirmed (empirically)

Series Conclusion

This series traced sequential prediction from Solomonoff’s incomputable ideal to practical algorithms:

  1. The problem: Predict the next symbol given history
  2. The ideal: Solomonoff induction, optimal but incomputable
  3. The framework: Bayesian model averaging over hypotheses
  4. The model class: Markov processes and tree sources
  5. The classical approach: N-grams and their limitations
  6. The trade-offs: Sample efficiency vs. flexibility
  7. The modern approach: Neural language models at scale
  8. The validation: CTW achieves its theoretical guarantees

CTW represents a sweet spot: provably optimal within a rich model class, computationally efficient, and interpretable. For the right problems (binary sequences, limited data, need for guarantees), it remains the right choice even in the age of LLMs.

Further Reading

  • Willems, F. M. J., et al. (1995). “The context-tree weighting method: basic properties.”
  • The code repository contains the full CTW implementation and experimental scripts.

Discussion