The Projection Layer

LLM Architecture Series – Lesson 12 of 20. The attention heads produce outputs that must be merged and projected back into the model hidden space.

This is done by a learned linear projection that mixes information from all heads into a single vector per position.

(more…)

Attention Scores and Softmax

LLM Architecture Series – Lesson 11 of 20. We now have raw attention scores between tokens. These scores must be turned into normalised weights.

The softmax over attention scores produces a probability distribution that says how much each token should influence the current position.

(more…)

Causal Masking – Preventing Future Leakage

LLM Architecture Series – Lesson 10 of 20. The attention mechanism is powerful, but for language modeling we must forbid peeking at future tokens.

Causal masking enforces this rule so that the model is always predicting the next token only from tokens that came before.

(more…)

Query, Key, Value in Attention

LLM Architecture Series – Lesson 9 of 20. Multi head attention relies on three sets of vectors called queries, keys, and values.

These vectors control how positions compare to each other and how information flows across the sequence.

(more…)

Understanding Self-Attention – Part 1

LLM Architecture Series – Lesson 7 of 20. So far each position has a clean, normalised embedding. Now the model needs a way for each position to look at other positions.

Self attention lets every token decide which other tokens are important for predicting the next output.

(more…)

Layer Normalization in Transformers

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalisation.

Layer normalisation rescales activations so that the following layers behave in a stable and predictable way.

(more…)

The Combined Input Embedding

LLM Architecture Series – Lesson 5 of 20. You have seen token embeddings and position embeddings. Now we put them together into the actual input that flows into the transformer.

This combined embedding is what every later layer sees for each position in the sequence.

(more…)

Position Embeddings – Encoding Word Order

LLM Architecture Series – Lesson 4 of 20. So far we know the meaning of each token. Now we teach the model where each token sits in the sequence.

Position embeddings make it possible for the model to tell the difference between subject and object, start and end, and repeated words at different places.

(more…)