Infinigram: Variable-Length N-grams via Suffix Arrays

December 3, 2025

Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.

The problem with fixed n-grams

Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to \(50000^5\) possible patterns. That is roughly 312 petabytes. Nobody does this.

Infinigram uses suffix arrays instead:

  • O(n) space: Linear in corpus size, not vocabulary size
  • O(m log n) queries: Fast pattern matching for any context length
  • Variable-length matching: Automatically uses the longest matching context

For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.

How it works

Given a context, Infinigram finds the longest matching suffix in the training corpus:

from infinigram import Infinigram

corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])

# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}

Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.

LLM probability mixing

The practical application I care about most: grounding LLM outputs without retraining.

# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram

This gives you:

  • Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
  • Hallucination reduction by anchoring to actual corpus content.
  • Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.

Projections as inductive biases

I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.

  • Runtime transforms: lowercase normalization, stemming, synonym expansion
  • Corpus augmentations: data augmentation, paraphrasing

This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”

Interactive REPL

Infinigram includes an interactive REPL for exploration:

infinigram-repl

infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20

Future: LangCalc integration

Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:

Read More

Language Calculus: An Algebraic Framework for LLM Composition

October 7, 2025

What if we could compose language models the way we compose functions in mathematics? What if there was an algebra of language models?

Language Calculus (langcalc) is an algebraic framework for building and reasoning about language model systems.

The Problem with Current LLM Composition

Today, combining language models typically means:

  • Ad-hoc ensembling techniques
  • Manual prompt chaining
  • Hardcoded decision trees
  • Black-box orchestration layers

There’s no principled way to reason about what these compositions do or how they behave. You wire things together and hope it works.

The Algebraic Approach

Language Calculus introduces operators with well-defined semantics:

Core Operators

M1 + M2     Mixture (weighted combination)
k * M       Scaling (temperature/probability adjustment)
M1 | M2     Maximum (most confident response)
M1 & M2     Minimum (most conservative response)
M1 ^ M2     Exclusive-or (diverse perspectives)
M ** t      Temperature adjustment
M ?? p      Threshold filtering
M >>> t     Truncation/limiting

Why This Matters

These operators satisfy algebraic laws:

(M1 + M2) + M3 = M1 + (M2 + M3)   # Associativity
M1 + M2 = M2 + M1                  # Commutativity
M + 0 = M                          # Identity
a * (M1 + M2) = a*M1 + a*M2        # Distributivity

This means we can transform, optimize, and reason about language model compositions algebraically. The laws aren’t just nice properties. They let you simplify compositions, prove equivalences, and optimize execution.

Practical Examples

Ensemble with Confidence Weighting

output = 0.4 * GPT4 + 0.3 * Claude + 0.3 * Llama

Expert Selection

code_task = (CodeLlama | GPT4) & SafetyModel

Diverse Brainstorming

ideas = CreativeModel ^ ConservativeModel ^ TechnicalModel
explore = Model ** 1.5
exploit = Model ** 0.2
adaptive = 0.7 * exploit + 0.3 * explore

Theoretical Foundations

The framework provides:

  • Formal semantics for each operator
  • Type system ensuring valid compositions
  • Equivalence relations for optimization
  • Normal forms for canonical representations

This lets us prove properties like:

  • Safety preservation under composition
  • Bias reduction through specific mixtures
  • Computational complexity bounds

Applications

Language Calculus enables:

  1. Automatic Optimization: Transform expensive compositions into equivalent cheaper ones
  2. Compositional Testing: Verify properties of complex systems from component properties
  3. Explainability: Understand what a composition does from its algebraic structure
  4. Meta-Learning: Learn optimal compositions for task families

Implementation

The paper includes:

Read More