Lost in Latent Space

Reading the residual stream: a 30-line walkthrough of the geometry that runs every transformer

def transformer_block(residual_stream, attn_heads, mlp):
    # residual_stream: shape [seq_len, d_model]
    # this is the only thing that flows through the network

    attn_out = sum(head(residual_stream) for head in attn_heads)
    residual_stream = residual_stream + attn_out          # additive update

    mlp_out = mlp(residual_stream)
    residual_stream = residual_stream + mlp_out           # another additive update

    return residual_stream

First thing to notice: every component reads from the residual stream and writes to the residual stream. Nothing replaces it. Every layer is purely additive.

That means the residual stream at layer L is literally:

residual_stream_L = embedding + sum(component_outputs_so_far)

Each token position has a vector of dimension d_model (typically 768, 4096, 12288). Different subspaces of that vector carry different information. The embedding writes into position-and-token-related directions; an attention head can write into a “previous-token-was-a-noun” direction; an MLP neuron can write into a “this-is-medical-text” direction.

# The residual stream is a shared scratchpad with effectively unlimited
# distinct directions. Different writers and readers don't have to
# coordinate explicitly — they just need to use almost-orthogonal
# directions.

The geometry trick: in a 4096-dimensional space, you can pack 4096 mutually-orthogonal unit vectors, but you can pack exponentially many almost-orthogonal ones (Johnson–Lindenstrauss). That’s why a 4096-dim residual stream can carry hundreds of thousands of distinct features without visibly interfering. This is the phenomenon Anthropic called superposition (Elhage et al., 2022).

# Practical consequence for interpretability:
# When you do PCA or a linear probe on activations, you almost
# never recover ONE clean feature per direction. You recover linear
# combinations. Sparse autoencoders (SAEs) try to undo this by
# finding an overcomplete basis where features ARE one-per-direction.

To probe the residual stream:

# Activation patching: swap in a different residual stream at layer L,
# see what changes downstream. Tells you "this information was needed here."

# Logit lens: project the residual stream at layer L through the unembedding.
# Tells you "what would the model predict if we stopped here."

# Direct logit attribution: write the unembedding as a linear function,
# split the residual stream by which component contributed, see which
# components push toward the eventual prediction. Lets you say "head X.Y
# is responsible for 15% of the answer."

The mental model worth carrying: the residual stream is not an “activation” — it’s a channel. The transformer is many components reading from and writing to the same channel, like an old party-line phone system. Once you see it that way, things like induction circuits and feature directions stop being mysterious — they’re just protocols on the channel.

What’s next

Anthropic’s “A Mathematical Framework for Transformer Circuits” (Elhage et al. 2021) for the canonical formalization. Neel Nanda’s interpretability tutorials walk through the code path concretely. For superposition, Elhage et al. “Toy Models of Superposition” (2022) is the must-read. For the SAE thread that followed, Cunningham et al. 2023 and Anthropic’s “Towards Monosemanticity” (2023).