The Complete Transformer Block

LLM Architecture Series – Lesson 15 of 20. Now we bring all familiar components together into the standard transformer block.

Each block contains attention, MLP, layer norms, and residual paths wired in a specific order.

Transformer block annotated

Visualization from bbycroft.net/llm augmented by Nano Banana.

Complete Transformer Block

Visualization from bbycroft.net/llm

Putting It All Together

A transformer block combines all the components we have discussed into a single unit. GPT models stack multiple identical blocks to create deep networks.

The Block Architecture

Each transformer block contains (in order):

Layer Norm (Pre-normalization)
Multi-Head Self-Attention
Residual Connection
Layer Norm
Feed-Forward Network (MLP)
Residual Connection

The Complete Formula

h = x + MultiHeadAttention(LayerNorm(x))
output = h + FFN(LayerNorm(h))

Pre-LN vs Post-LN

GPT uses Pre-LN (LayerNorm before each sublayer), which:

Makes training more stable
Allows larger learning rates
Enables training very deep networks

Parameters Per Block

For nano-gpt (d=48, 3 heads, 4x expansion):

Attention: 4 × 48 × 48 = 9,216 parameters
FFN: 2 × 48 × 192 = 18,432 parameters
LayerNorm: 2 × 2 × 48 = 192 parameters
Total per block: ~28,000 parameters

Series Navigation

Previous: Residual Connections

Next: Stacking Transformer Layers

This article is part of the LLM Architecture Series. Interactive visualizations from bbycroft.net/llm.

Analogy and intuition

You can think of a transformer block as one full reasoning step. Information flows in, attends to other positions, is processed by an MLP, and then passed on along residual paths.

Stacking many such blocks lets the model refine its understanding over multiple passes, similar to revisiting the same text with deeper insight each time.

Looking ahead

Next we will see what happens when we stack many transformer blocks on top of each other.