Layer Normalization in Transformers

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalisation.

Layer normalisation rescales activations so that the following layers behave in a stable and predictable way.

Layer normalisation annotated

Visualisation from bbycroft.net/llm augmented by Nano Banana.

Layer Normalisation

Visualisation from bbycroft.net/llm

What is Layer Normalisation?

Layer Normalisation (LayerNorm) is a technique that stabilizes training by normalising the inputs to each layer. It ensures that the values flowing through the network stay in a reasonable range.

The LayerNorm Formula

For an input vector x with d dimensions:

LayerNorm(x) = γ · (x – μ) / √(σ² + ε) + β

Where:

μ = mean of x across dimensions
σ² = variance of x across dimensions
γ, β = learned scale and shift parameters
ε = small constant for numerical stability

Where LayerNorm Appears

In GPT-style transformers, LayerNorm appears:

Before self-attention (Pre-LN architecture)
Before the feed-forward network
After the final transformer block

Why It Matters

Without normalisation:

Activations can grow unboundedly large
Gradients can vanish or explode
Training becomes unstable

LayerNorm keeps everything in check, enabling training of very deep networks.

LayerNorm vs BatchNorm

Unlike Batch Normalisation, Layer Normalisation:

Normalises across features, not batch
Works identically at training and inference
Is independent of batch size

Series Navigation

Previous: Input Embedding

Next: Understanding Self-Attention – Part 1

This article is part of the LLM Architecture Series. Interactive visualisations from bbycroft.net/llm.

Analogy and intuition

Layer normalisation is like a sound engineer who constantly adjusts volume levels so that the music is neither too loud nor too quiet for the speakers.

Without this control, some neurons could saturate while others never activate, which would make learning and inference unstable.

Looking ahead

With stable activations we are ready to introduce the key idea of transformers, the self attention mechanism.