Below you will find pages that utilize the taxonomy term “Infinigram”
Infinigram: Variable-Length N-grams via Suffix Arrays
December 3, 2025
Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.
The problem with fixed n-grams
Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to \(50000^5\) possible patterns. That is roughly 312 petabytes. Nobody does this.
Infinigram uses suffix arrays instead:
- O(n) space: Linear in corpus size, not vocabulary size
- O(m log n) queries: Fast pattern matching for any context length
- Variable-length matching: Automatically uses the longest matching context
For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.
How it works
Given a context, Infinigram finds the longest matching suffix in the training corpus:
from infinigram import Infinigram
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])
# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.
LLM probability mixing
The practical application I care about most: grounding LLM outputs without retraining.
# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
This gives you:
- Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
- Hallucination reduction by anchoring to actual corpus content.
- Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.
Projections as inductive biases
I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.
- Runtime transforms: lowercase normalization, stemming, synonym expansion
- Corpus augmentations: data augmentation, paraphrasing
This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”
Interactive REPL
Infinigram includes an interactive REPL for exploration:
infinigram-repl
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Future: LangCalc integration
Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.