Skip to main content

KL-Threshold Routing Between LLMs: What Speculative Decoding Already Solved

In late 2023 I started a paper called Mixture-of-Experts: KL-Divergence Threshold. The setup: run the small LLM by default, periodically check its next-token distribution against a larger reference model by computing KL divergence, fall back to the large model when the small one drifts too far. I never finished the experiments. Going back to the draft now, the kernel of the idea was right and the framing was wrong.

Here is what I had. Initialize generation with the large model so the prefix is high quality. Switch to the small model for the bulk of generation. Every k tokens, also generate with the large model and compute

D_KL(P_large || Q_small) = sum_i P_large(i) * log(P_large(i) / Q_small(i))

over the shared vocabulary. If D_KL is below threshold, keep using the small model. If it crosses, fall back. The threshold is a hyperparameter you tune.

Two problems, both fatal in retrospect.

First, the formal version already existed. Leviathan, Kalman, and Matias published Fast Inference from Transformers via Speculative Decoding on arXiv in November 2022, with the ICML paper appearing in 2023. Speculative decoding does the same idea properly. A small “draft” model proposes k tokens, the large model verifies by computing its own probabilities, and a rejection-sampling step on the ratio P_large(t) / Q_small(t) guarantees that the output distribution exactly matches the large model’s. No threshold. No periodic check. No quality drift to worry about. A cleaner statement of the same idea, with a guarantee instead of a hyperparameter.

I should have known about it. It was already on arXiv when I wrote the draft. That is the first lesson, and the embarrassing one: search the literature before you write the methodology.

Second, I did not think carefully about KV cache. Switching models mid-generation is not a switch, it is a re-prefill. The “other” model has no KV state for the prefix you just generated, so falling back means recomputing attention over the whole context. Once you put real numbers in, the cost equation changes completely. Speculative decoding sidesteps this by keeping both models’ KV caches alive throughout, which is part of why it is the right primitive and a periodic-recheck scheme is not.

What is still alive in the draft, after stripping the speculative-decoding-shaped chunk:

  1. KL over a concept space, not the token vocabulary. Two LLMs that disagree on tokens may agree on concepts. If you map distributions over {king, queen, ruler, monarch, ...} to a coarser distribution over a latent concept like ruler, divergence at that level is a different signal. Routing on concept-level agreement is not a special case of speculative decoding because the guarantee is about the token distribution, not the meaning. Section 8.3 of the original draft, the only piece I would actually want to write up.

  2. The shared-vocabulary requirement. KL is undefined when two LLMs tokenize differently, and the draft waved at this. The interesting question is what to do when they do not share a vocab: subword realignment, distillation to a common vocab, or computing divergence in embedding space directly with optimal transport. This is closer to a measurement problem than a routing problem, but it is the prerequisite for any cross-family ensemble, and it is real research.

  3. Calibration-aware deferral for distilled small models. When does a fine-tuned 1B model actually agree with its 70B teacher, and how cheap is the agreement test? This is closer to selective prediction and learned-deferral literature than to speculative decoding, and there is room.

The general lesson, for me at least: when you have a routing-or-cascade-or-ensemble idea over LLMs, check what has been formalized already. The 2022 to 2024 window has a lot of “I had an idea like that, but the rigorous version already exists” land mines, and most of them are downstream of speculative decoding. Adding value now means going one layer up: routing over concepts, agreement under tokenizer mismatch, calibration-aware deferral. The token-level version is a solved problem.

The original LaTeX has been pushed to github.com/queelius/kl-foundation-ref for the record and archived. Moving on.

Discussion