Induction-Heads

Browse posts by tag

Attention Weight Is Not Information Flow

The trained pointer model reads exactly the right memory cell, provably. Its attention barely shows where. The gap, and the causal probe that closes it.