The experiment: a tiny SmolLM2 running in your browser. A token-level n-gram trained on every word I have published. Mix the two distributions in probability space at every generation step. Sample from the mix.
You can try it at /ask. There is a slider for how strongly the n-gram bleeds in.
I expected a small chatbot that sounds like me. I got a 135M parameter model that uses my words to produce paranoid lorem-ipsum, and a 1.7B model that mostly behaves like a competent chatbot decorated with my function words. Sometimes a phrase comes out that I might actually write. More often the output is grammatically OK but conceptually empty.
The result is weak. The architecture is interesting anyway, and I want to write up why.
The Pieces
Three things, all running locally:
-
SmolLM2-Instruct in 135M, 360M, or 1.7B sizes. Q4_K_M GGUF, served from HuggingFace, run via Wllama. 90 MB to 1 GB on disk. CPU only.
-
A token-level n-gram over my blog corpus: every post tokenized with SmolLM2’s BPE, indexed with a suffix array. 1.6 MB of source text, 470,000 tokens, 1.9 MB suffix array.
-
A token-by-token sampling loop that mixes the LLM’s output distribution with the n-gram’s, in probability space.
The third piece is the part worth thinking about.
The Math
At each generation step the LLM produces p_llm(t) over its 49,152-token vocabulary, exposed by Wllama’s getLogits(-1). The n-gram, given the longest suffix of the current context that occurs in the corpus, produces a sparse p_ngram(t): nonzero on tokens it has seen following that context, zero elsewhere.
Linear combination:
p_mix(t) = α · p_ngram(t) + (1 − α) · p_llm(t)
That is the whole algorithm. The inner loop is small enough to fit on screen:
for (let step = 0; step < N; step++) {
const llm = await wllama.getLogits(-1);
const m = ig.longestSuffixMatch(context);
const ngram = m.suffixLen > 0
? new Map(ig.continuations(m.matchedTokens).map(c => [c.token, c.prob]))
: null;
const mix = new Map();
for (const { token, p } of llm) {
const pn = ngram?.get(token) ?? 0;
mix.set(token, alpha * pn + (1 - alpha) * p);
}
const next = sample(mix, temperature);
if (await wllama.isTokenEOG(next)) break;
context.push(next);
await wllama.decode([next]);
}
Tokens unseen by the n-gram have p_ngram = 0 and retain (1 − α) · p_llm in the mixture. They are not zeroed out, just unboosted.
α = 0 is the LLM. α = 1 is the n-gram, which loops as soon as the generated context drifts off-corpus. In between is a model running on LLM grammar with n-gram register.
What It Actually Produces
The 135M, prompt “Hello”, α = 0.1:
Hi! I’m working on a project with you! I’m trying to hide some of your work. I’ll have to wait for your response, but I’ll give you a brief summary to let you know what I’m working on. Hope you’re doing well!
The vocabulary is mine. I write about hiding data, encrypted search, comparing approaches. The 135M has no idea what those concepts mean. It glues my words into paragraphs that read like a paranoid academic argument with the visitor.
The 1.7B is better but the failure mode is more subtle. With α = 0.1 it produces grammatical paragraphs that drop in occasional Alex-shaped phrases (“compositional depth”, “rules as data”, project names from my corpus). At α = 0 it sounds like a generic chatbot. At α = 1 it sounds like me, in fragments, looping.
What mixing-at-sample-time actually does, I think, is pull register without pulling competence. The n-gram makes the surface me-shaped. It does not give the model any of the conceptual machinery that lets me mean something when I use those words. The output is texture, not thought.
Why The Architecture Is Interesting Anyway
The result is weak, but the structural properties are real:
The corpus stays uncompressed. Conventional fine-tuning is a lossy gradient encoding of your training distribution. Here, the corpus ships in full. The “fine-tune” is a static asset you can read.
Adding a document is a 30 ms rebuild. Not a training run.
Removing a document is the same operation reversed. No question of whether the model “forgot” it. There is no gradient that could remember.
Auditing memorization is a binary search. O(log n) to find every occurrence of a span in the corpus. With gradient-descent fine-tuning, you cannot easily answer this.
Tokenizer parity makes model swaps free. All three SmolLM2 sizes share the BPE, so the same suffix array works against any of them.
These properties are why I think the idea has legs even though the demo doesn’t.
What’s Wrong With It
The failure modes I observed, mostly so I know what to fix:
Linear mixing is a soft prior, not a guide. It lifts tokens. It does not constrain meaning. The model still hallucinates; the prior just makes it hallucinate in my register.
Token-level matching is too local. The n-gram looks at the longest in-corpus suffix, which is usually a handful of tokens. Enough for register, not enough for thematic coherence. Sequential sentences can pull from completely different parts of the corpus.
The corpus is too small. 1.6 MB has only so many patterns. With α near 1 the model loops within seconds.
Tiny models cannot be steered into competence. The 135M under any prior is a syntax engine. With the prior it is a syntax engine with my vocabulary. Useful for a curio, not for a tool.
Temperature interacts unpredictably. Applying temperature to the mixed distribution is mathematically clean, but small changes to T produce large changes in output character that I cannot tune intuitively.
What Would Make It Work
Roughly in order of how much I think each would help:
Passage-level retrieval, also mixed in. A second prior at the level of full passages, retrieved via embedding search, with passage-level distributions mixed alongside the token-level n-gram. Token-level for register, passage-level for theme.
Logit-space mixing instead of probability-space. log p_mix = log p_llm + λ · log(p_ngram + ε) actively penalizes tokens with low corpus support, rather than only adding lift to high-support ones. Sharper steering.
A different LLM family. SmolLM2 caps at 1.7B. Qwen 2.5 1.5B or Llama 3.2 3B are similar size and stronger on instruction-following. Requires retokenizing the corpus; the JS doesn’t change.
Context-dependent α. Right now α is fixed. When the longest match is short (the LLM is off-corpus), trust the LLM more. When the match is long (the LLM is producing in-corpus text), trust the prior more. Probably the change with the most upside of these.
A Note On Local-First
The reason the whole pipeline runs in the browser, even though the result is weak, is that this is the use case fine-tuning has historically been worst at: people who want a personal model from their own writing, without running a GPU cluster, without uploading their writing anywhere, and without ending up with an opaque-weights artifact they cannot inspect.
Sample-time mixing with a public LLM is one possible answer. The model is a public utility, the corpus is a static asset, the mixing is in user space. Anyone curious about what is being applied to the model can read the code.
This particular implementation is mediocre. The shape of the answer is right.
What This Is
Not a fine-tune. Not a chatbot that sounds like me. A research artifact. A thing that shows LLM and n-gram can be composed at sample time, with all the structural properties that follow, and that the architecture is worth more work even though this particular composition does not yet land.
The toy is at /ask, /alex, and /infinigram. Have at it.
Discussion