Skip to main content

Attention Weight Is Not Information Flow

Take the deeper pointer model from the last post and ask two questions. Does it compute the right thing? And can you see how by reading its attention? The answers come apart, and that gap is the whole lesson of mechanistic interpretability, in miniature.

Behaviorally, it is a perfect pointer

There is a test that does not depend on reading any attention map. Flip the bit at the addressed memory cell and check the output: it tracks the new value about 99.5% of the time. Flip any other cell and the output changes about 0.5% of the time. So the model reads exactly the addressed cell and nothing else. That is airtight, and it required looking at no internals at all.

Mechanically, the clean circuit is gone

At small memory the attention was legible. One layer gathered the address, the next spiked on the addressed cell, and you could read the dereference straight off the maps. Scale the memory up and that legibility evaporates: the final query barely concentrates on the correct cell for most addresses. Stare at the attention and you mostly do not see the lookup happening. But the flip test says it is happening. So which is it.

The resolution is a fact worth internalizing: attention weight is not information flow. A head’s output is the attention-weighted sum of values, and the value projection can quietly suppress a cell you attend to hard, or carry real information through a cell you barely attend to at all. On top of that, the computation spreads across heads and layers. The attention map is a weak instrument the moment the work distributes.

The causal probe that recovers it

The right tool is causal. Take a clean run and a run with the addressed bit flipped, then patch the clean residual stream back in at one (layer, position) at a time and see where restoring it restores the answer. Do that and a clean picture appears that the attention maps hid: the bit’s value is transported from the addressed cell toward the readout position, across the layers. The structure was there the whole time, in the residual stream, not in the attention weights.

What actually happened, including the wrong turns

I want to be honest about how this went, because the tidy version is a lie. I started with a clean hypothesis: that a particular layer “refines” the lookup the earlier layers begin. The data refuted it. I formed a second clean hypothesis, that the heads partition the address space among themselves. The data refuted that too. What survived is narrower and less satisfying: the model computes the right cell (the flip test is not negotiable), the extra depth buys a multi-stage assembly of the query, and the dereference itself is distributed in a way the attention maps will not show you. Two nice stories died to get there.

This is the induction-head story, small

None of this is exotic. It is the induction-head and in-context-learning picture compressed into a toy. In real language models the same primitive, find the relevant earlier position and copy what is useful from it, is spread across many heads, and ablating any single one barely moves the behavior because the others cover for it. The clean mechanistic accounts in that literature come from causal patching, not from looking at attention. The small-memory-to-large-memory step here is that whole arc in fast-forward: a legible circuit when the problem is tiny, illegible attention when it grows, and a causal probe that puts the mechanism back in view.

The lesson generalizes past this toy. Behavioral and causal legibility can stay airtight while the read-it-off-the-weights kind degrades with scale. That is why the field reaches for causal interventions, and why “head 7 is the lookup head” is a sentence you should not believe without a causal test behind it.

The flip test, the per-layer causal trace, the comparison to induction heads, and the two refuted hypotheses written up in full are in Chapter 7 of the book, Inductive Biases in Neural Networks.

Discussion