Lost in Latent Space

Induction heads: the cleanest mechanistic story we have, and why the field overrated it

The induction-head story (Olsson et al., 2022) is canon now: in transformer LMs, you reliably find heads that implement the pattern if I saw [A][B] earlier in the sequence and now I see [A], predict [B]. It’s a two-head circuit — a previous-token head writes the previous token’s identity into each position’s residual stream, then a downstream head attends from the current token to wherever the prev-token head says “[A] just happened” and copies the token after.

This is genuinely beautiful. It’s mechanistic interpretability’s first reproducible “we found a circuit” moment. The phase change in the loss curve right when induction heads form is real. The story explains in-context learning at small scale.

But the field, in 2022–2023, took this as evidence that transformers are basically running a stack of legible algorithms you can read off the weights with enough patience. That turned out to be an overreach. The refined picture five years later:

Most attention heads are not “doing” anything that clean. Wang et al.’s IOI circuit and Conmy et al.’s automated circuit discovery show real model behavior involves dozens of heads in superposition — overlapping, redundant, partially correlated. Induction is the exception clean enough to read by hand; it’s not representative.
Induction in toy LMs ≠ induction in frontier LMs. The Olsson result is sharpest in 2L attention-only models. By 70B+, the “induction head” is more like a distributed function across many heads, MLPs, and stuff that doesn’t factor neatly into the attention/MLP split.
The phase change wasn’t the discovery; the gradient was. The really useful technical export was logit attribution + activation patching, not the circuit itself. That toolkit is what unlocked everything after — IOI, indirect object identification, refusal directions, SAE features.

What’s actually robust: structured prefix-matching matters at every scale, models clearly improve at it with training, and it’s tightly coupled to in-context learning. Whether it’s two heads or two thousand neurons in superposition is an implementation detail.

The lesson for new readers: read the paper, but read it for the methods. The induction head itself is a clean instance of a fuzzier phenomenon — and the fuzzy version is what the frontier models are actually doing.

What’s next

Anthropic’s “A Mathematical Framework for Transformer Circuits” (2021) for the residual-stream framing this all sits on. ARENA’s interpretability curriculum if you want hands-on. Olsson et al. “In-context Learning and Induction Heads” (2022) for the canonical story. Wang et al. “Interpretability in the Wild” (2022) for the IOI circuit, which gets you off the toy-model island and into the messier reality.