The Projection Layer – SudoAll.com

LLM Architecture Series – Lesson 12 of 20. The attention heads produce outputs that must be merged and projected back into the model hidden space.

This is done by a learned linear projection that mixes information from all heads into a single vector per position.

Projection layer annotated

Visualization from bbycroft.net/llm augmented by Nano Banana.

Projection Layer

Visualization from bbycroft.net/llm

After Attention: The Output Projection

After multi-head attention computes its outputs, we need to combine the results from all heads back into a single representation. This is done by the projection layer.

Concatenate and Project

The multi-head attention output is formed by:

Concatenate all head outputs along the feature dimension
Project through a linear layer W_O

Output = Concat(head₁, …, head_h) · W_O

Dimension Flow

In nano-gpt with 3 heads and d=48:

Each head output: 16 dimensions
Concatenated: 48 dimensions
After projection: 48 dimensions (same as input)

The Residual Connection

The projection output is added to the original input (residual connection):

output = input + Attention(LayerNorm(input))

This allows gradients to flow directly through the network and helps train very deep models.

In the Visualization

The green arrows show how attention output flows through the projection layer and combines with the residual path before entering the next component.

Series Navigation

Previous: Attention Softmax

Next: Feed-Forward Networks (MLP)

This article is part of the LLM Architecture Series. Interactive visualizations from bbycroft.net/llm.

Analogy and intuition

Think of the projection layer as a mixing desk that takes many specialist channels and blends them into a single track that the next stage can understand.

Although this step is mathematically simple, it is crucial because it decides how the different attention heads interact.

Looking ahead

Next we will explore the feed forward network, a small multilayer perceptron that adds non linear reasoning on top of attention.