Stacking Transformer Layers

LLM Architecture Series – Lesson 16 of 20. A single transformer block is powerful, but modern LLMs use many of them in sequence.

Each additional layer can capture longer range patterns and refine the representations produced by earlier layers.

(more…)

The Complete Transformer Block

LLM Architecture Series – Lesson 15 of 20. Now we bring all familiar components together into the standard transformer block.

Each block contains attention, MLP, layer norms, and residual paths wired in a specific order.

(more…)

Residual Connections

LLM Architecture Series – Lesson 14 of 20. At this point attention and MLP layers are doing heavy work. Residual connections make sure information can flow easily through many layers.

By adding the input of a block back to its output, residual paths help gradients move during training and preserve useful signals.

(more…)

Feed-Forward Networks (MLP)

LLM Architecture Series – Lesson 13 of 20. After attention and projection we pass each position through a feed forward network.

This multilayer perceptron applies the same small neural network to every position independently, adding powerful non linear transformations.

(more…)

The Projection Layer

LLM Architecture Series – Lesson 12 of 20. The attention heads produce outputs that must be merged and projected back into the model hidden space.

This is done by a learned linear projection that mixes information from all heads into a single vector per position.

(more…)

Attention Scores and Softmax

LLM Architecture Series – Lesson 11 of 20. We now have raw attention scores between tokens. These scores must be turned into normalised weights.

The softmax over attention scores produces a probability distribution that says how much each token should influence the current position.

(more…)

Causal Masking – Preventing Future Leakage

LLM Architecture Series – Lesson 10 of 20. The attention mechanism is powerful, but for language modeling we must forbid peeking at future tokens.

Causal masking enforces this rule so that the model is always predicting the next token only from tokens that came before.

(more…)

Query, Key, Value in Attention

LLM Architecture Series – Lesson 9 of 20. Multi head attention relies on three sets of vectors called queries, keys, and values.

These vectors control how positions compare to each other and how information flows across the sequence.

(more…)

Understanding Self-Attention – Part 1

LLM Architecture Series – Lesson 7 of 20. So far each position has a clean, normalised embedding. Now the model needs a way for each position to look at other positions.

Self attention lets every token decide which other tokens are important for predicting the next output.

(more…)