Residual Connections – SudoAll.com

LLM Architecture Series – Lesson 14 of 20. At this point attention and MLP layers are doing heavy work. Residual connections make sure information can flow easily through many layers.

By adding the input of a block back to its output, residual paths help gradients move during training and preserve useful signals.

Residual connections annotated

Visualization from bbycroft.net/llm augmented by Nano Banana.

Residual Connections

Visualization from bbycroft.net/llm

The Skip Connection

Residual connections (or skip connections) add the input of a layer directly to its output:

output = input + SubLayer(input)

Why Residuals Matter

In deep networks, gradients must flow through many layers. Without residuals:

Gradients can vanish (become too small)
Gradients can explode (become too large)
Deep networks become untrainable

The Gradient Highway

Residual connections create a “gradient highway” – a direct path for gradients to flow from output to input:

∂output/∂input = 1 + ∂SubLayer/∂input

The “1” ensures gradients never completely vanish.

Residuals in Transformers

Each transformer block has two residual connections:

x = x + Attention(LayerNorm(x))   // After attention
x = x + FFN(LayerNorm(x))         // After feed-forward

Enabling Depth

Residual connections are why we can train transformers with 96+ layers. Each layer can focus on learning a small refinement, knowing the original information is preserved.

Series Navigation

Previous: Feed-Forward MLP

Next: The Complete Transformer Block

This article is part of the LLM Architecture Series. Interactive visualizations from bbycroft.net/llm.

Analogy and intuition

Residual connections are like a fast highway that runs in parallel with smaller local roads. Even if some side roads are slow or noisy, the highway keeps traffic moving.

They allow the network to learn corrections on top of an identity mapping, which is much easier than learning a good transformation from scratch.

Looking ahead

Next we will assemble attention, MLP, normalization, and residual paths into a full transformer block.