Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.
The problem with fixed n-grams
Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to possible patterns. That is roughly 312 petabytes. Nobody does this.
Infinigram uses suffix arrays instead:
- O(n) space: Linear in corpus size, not vocabulary size
- O(m log n) queries: Fast pattern matching for any context length
- Variable-length matching: Automatically uses the longest matching context
For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.
How it works
Given a context, Infinigram finds the longest matching suffix in the training corpus:
from infinigram import Infinigram
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])
# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.
LLM probability mixing
The practical application I care about most: grounding LLM outputs without retraining.
# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
This gives you:
- Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
- Hallucination reduction by anchoring to actual corpus content.
- Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.
Projections as inductive biases
I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.
- Runtime transforms: lowercase normalization, stemming, synonym expansion
- Corpus augmentations: data augmentation, paraphrasing
This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”
Interactive REPL
Infinigram includes an interactive REPL for exploration:
infinigram-repl
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Future: LangCalc integration
Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:
# Compose models algebraically
model = 0.7 * llm + 0.2 * wiki_infinigram + 0.1 * code_infinigram
Mix neural and corpus-based models with explicit control over domain influence.
Resources
- PyPI:
pip install py-infinigram - GitHub: github.com/queelius/infinigram
- Paper: Infinigram Technical Report
- Related: LangCalc Project
Background
This builds on ideas from a SLUUG talk on LLMs where I demonstrated arbitrary-size n-grams using recursive dictionaries for a crude expression evaluator. Infinigram takes those ideas further with suffix arrays and the projection framework for generalization.
Discussion