Layer Normalization in Transformers

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalization.

Layer normalization rescales activations so that the following layers behave in a stable and predictable way.

Layer normalization annotated

Visualization from bbycroft.net/llm augmented by Nano Banana.

Layer Normalization

Visualization from bbycroft.net/llm

What is Layer Normalization?

Layer Normalization (LayerNorm) is a technique that stabilizes training by normalizing the inputs to each layer. It ensures that the values flowing through the network stay in a reasonable range.

The LayerNorm Formula

For an input vector x with d dimensions:

LayerNorm(x) = γ · (x – μ) / √(σ2 + ε) + β

Where:

  • μ = mean of x across dimensions
  • σ2 = variance of x across dimensions
  • γ, β = learned scale and shift parameters
  • ε = small constant for numerical stability

Where LayerNorm Appears

In GPT-style transformers, LayerNorm appears:

  1. Before self-attention (Pre-LN architecture)
  2. Before the feed-forward network
  3. After the final transformer block

Why It Matters

Without normalization:

  • Activations can grow unboundedly large
  • Gradients can vanish or explode
  • Training becomes unstable

LayerNorm keeps everything in check, enabling training of very deep networks.

LayerNorm vs BatchNorm

Unlike Batch Normalization, Layer Normalization:

  • Normalizes across features, not batch
  • Works identically at training and inference
  • Is independent of batch size

Series Navigation

Previous: Input Embedding

Next: Understanding Self-Attention – Part 1


This article is part of the LLM Architecture Series. Interactive visualizations from bbycroft.net/llm.

Analogy and intuition

Layer normalization is like a sound engineer who constantly adjusts volume levels so that the music is neither too loud nor too quiet for the speakers.

Without this control, some neurons could saturate while others never activate, which would make learning and inference unstable.

Looking ahead

With stable activations we are ready to introduce the key idea of transformers, the self attention mechanism.

CategoriesAI

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.