Skip to main content

Attention Is a Learned Pointer Dereference

An attention head is a learned content-addressable lookup. A query at one position is compared against keys at every position, a softmax picks the best match, and the head returns that position’s value. Read it as a pointer dereference: the query says “I want the thing at the position that looks like this,” the keys advertise what each position is, and the softmax spikes on the one that matches. Out comes its value.

That is the entire structural prior of a transformer. Not locality, not recurrence, but a learned mechanism for deciding which positions matter, computed from the data, position by position, on the fly. The earlier architectures committed to a fixed notion of relevance (nearby pixels, recent timesteps). Attention commits to a way of figuring out relevance instead.

A task that forces the issue

A claim about a mechanism is worth more if you can build something that requires it. So here is a deliberately small task. Feed the model a string of memory bits, then an address, and ask it to output the memory bit at that address. To get it right the model has to read the address and fetch the bit it points to. That is content-addressable memory, stripped to the bone, and a model with no lookup mechanism cannot do better than guessing.

A transformer learns it, and watching attention while it does is the point: the relevant head, at the output position, puts almost all its weight on the addressed memory cell. The pointer dereference is visible.

Depth is the number of lookups you can compose

One attention layer is not enough, for a reason that is structural rather than about capacity. The query at the final position is built only from that position’s own token, so it does not yet know the full address (the address bits live at other positions). It cannot route to a cell whose identity depends on bits it has not seen. Two layers fix this: the first layer gathers the address bits into one place, and the second layer uses the assembled address to do the fetch. Transformer depth is, quite literally, how many lookups you can chain. Each layer is one more stage of dynamic routing.

The experiment that earned its keep

Then I scaled the memory up, expecting it to just keep working, and it did not, which turned out to be the most useful thing in the chapter. At a certain memory size a two-layer model stops solving the task, and here is the part worth sitting with: making it wider does not help, and training it longer does not help. It sits at chance-plus no matter what.

Add one layer, holding the width and the training budget fixed, and it transitions from stuck to solved. Depth is the knob. Width is not. That is exactly what the composed-lookup story predicts: a larger memory needs a sharper, more assembled query than two layers can build, and the third layer is where the extra assembly happens. It is a clean little result about when an architecture’s prior is actually reachable, and it sets up an uncomfortable question for the next post.

Because here is the thing. The two-layer model’s attention plainly showed the lookup. Does the deeper model’s attention show it too, once the task gets hard? That is where the story stops being tidy.

The attention mechanism in code, the structural argument for why one layer fails, and the controlled depth-versus-width experiment are in Chapter 6 of the book, Inductive Biases in Neural Networks.

Discussion