Infinigram: Voice as Distributional Prior

This is a variable-length n-gram language model over the prose of this blog, indexed at the level of SmolLM2’s tokenizer. Every word, sentence, and paragraph I’ve published becomes a sequence of token ids; the suffix array indexes every contiguous span that ever occurred. Type a prefix and you see two things: the longest suffix of your input (in tokens) that the corpus has actually seen, and the empirical distribution over what token came next.

The model is the corpus. There is no training; queries are O(m log n) binary searches over a sorted array of token positions. About 6 MB downloads once and is cached. Tokenization uses the same vocabulary SmolLM2 was trained on, so eventually this distribution will mix directly into its logits as a register prior, with no alignment seam.

The point isn’t to be smart. It’s to make voice visible. When the model says “next token is the word ’this’, 12% of the time,” that’s me talking on past evidence. The longest matched span is whatever the corpus has actually seen me write. Sort of a mirror.

Prefix

Match

Top continuations

Where this match occurs (sample)

Auto-extend