Below you will find pages that utilize the taxonomy term “Langcalc”
Infinigram: Variable-Length N-grams via Suffix Arrays
December 3, 2025
Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.
The problem with fixed n-grams
Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to \(50000^5\) possible patterns. That is roughly 312 petabytes. Nobody does this.
Infinigram uses suffix arrays instead:
- O(n) space: Linear in corpus size, not vocabulary size
- O(m log n) queries: Fast pattern matching for any context length
- Variable-length matching: Automatically uses the longest matching context
For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.
How it works
Given a context, Infinigram finds the longest matching suffix in the training corpus:
from infinigram import Infinigram
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])
# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.
LLM probability mixing
The practical application I care about most: grounding LLM outputs without retraining.
# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
This gives you:
- Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
- Hallucination reduction by anchoring to actual corpus content.
- Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.
Projections as inductive biases
I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.
- Runtime transforms: lowercase normalization, stemming, synonym expansion
- Corpus augmentations: data augmentation, paraphrasing
This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”
Interactive REPL
Infinigram includes an interactive REPL for exploration:
infinigram-repl
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Future: LangCalc integration
Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:
Language Calculus: An Algebraic Framework for LLM Composition
October 7, 2025
What if we could compose language models the way we compose functions in mathematics? What if there was an algebra of language models?
Language Calculus (langcalc) is an algebraic framework for building and reasoning about language model systems.
The Problem with Current LLM Composition
Today, combining language models typically means:
- Ad-hoc ensembling techniques
- Manual prompt chaining
- Hardcoded decision trees
- Black-box orchestration layers
There’s no principled way to reason about what these compositions do or how they behave. You wire things together and hope it works.
The Algebraic Approach
Language Calculus introduces operators with well-defined semantics:
Core Operators
M1 + M2 Mixture (weighted combination)
k * M Scaling (temperature/probability adjustment)
M1 | M2 Maximum (most confident response)
M1 & M2 Minimum (most conservative response)
M1 ^ M2 Exclusive-or (diverse perspectives)
M ** t Temperature adjustment
M ?? p Threshold filtering
M >>> t Truncation/limiting
Why This Matters
These operators satisfy algebraic laws:
(M1 + M2) + M3 = M1 + (M2 + M3) # Associativity
M1 + M2 = M2 + M1 # Commutativity
M + 0 = M # Identity
a * (M1 + M2) = a*M1 + a*M2 # Distributivity
This means we can transform, optimize, and reason about language model compositions algebraically. The laws aren’t just nice properties. They let you simplify compositions, prove equivalences, and optimize execution.
Practical Examples
Ensemble with Confidence Weighting
output = 0.4 * GPT4 + 0.3 * Claude + 0.3 * Llama
Expert Selection
code_task = (CodeLlama | GPT4) & SafetyModel
Diverse Brainstorming
ideas = CreativeModel ^ ConservativeModel ^ TechnicalModel
Temperature Search
explore = Model ** 1.5
exploit = Model ** 0.2
adaptive = 0.7 * exploit + 0.3 * explore
Theoretical Foundations
The framework provides:
- Formal semantics for each operator
- Type system ensuring valid compositions
- Equivalence relations for optimization
- Normal forms for canonical representations
This lets us prove properties like:
- Safety preservation under composition
- Bias reduction through specific mixtures
- Computational complexity bounds
Applications
Language Calculus enables:
- Automatic Optimization: Transform expensive compositions into equivalent cheaper ones
- Compositional Testing: Verify properties of complex systems from component properties
- Explainability: Understand what a composition does from its algebraic structure
- Meta-Learning: Learn optimal compositions for task families
Implementation
The paper includes: