LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalisation.
Layer normalisation rescales activations so that the following layers behave in a stable and predictable way.

Visualisation from bbycroft.net/llm augmented by Nano Banana.

Visualisation from bbycroft.net/llm
What is Layer Normalisation?
Layer Normalisation (LayerNorm) is a technique that stabilizes training by normalising the inputs to each layer. It ensures that the values flowing through the network stay in a reasonable range.
The LayerNorm Formula
For an input vector x with d dimensions:
LayerNorm(x) = γ · (x – μ) / √(σ2 + ε) + β
Where:
- μ = mean of x across dimensions
- σ2 = variance of x across dimensions
- γ, β = learned scale and shift parameters
- ε = small constant for numerical stability
Where LayerNorm Appears
In GPT-style transformers, LayerNorm appears:
- Before self-attention (Pre-LN architecture)
- Before the feed-forward network
- After the final transformer block
Why It Matters
Without normalisation:
- Activations can grow unboundedly large
- Gradients can vanish or explode
- Training becomes unstable
LayerNorm keeps everything in check, enabling training of very deep networks.
LayerNorm vs BatchNorm
Unlike Batch Normalisation, Layer Normalisation:
- Normalises across features, not batch
- Works identically at training and inference
- Is independent of batch size
Series Navigation
Previous: Input Embedding
Next: Understanding Self-Attention – Part 1
This article is part of the LLM Architecture Series. Interactive visualisations from bbycroft.net/llm.
Analogy and intuition
Layer normalisation is like a sound engineer who constantly adjusts volume levels so that the music is neither too loud nor too quiet for the speakers.
Without this control, some neurons could saturate while others never activate, which would make learning and inference unstable.
Looking ahead
With stable activations we are ready to introduce the key idea of transformers, the self attention mechanism.
