Attention Scores and Softmax

LLM Architecture Series – Lesson 11 of 20. We now have raw attention scores between tokens. These scores must be turned into normalised weights.

The softmax over attention scores produces a probability distribution that says how much each token should influence the current position.

(more…)

Causal Masking – Preventing Future Leakage

LLM Architecture Series – Lesson 10 of 20. The attention mechanism is powerful, but for language modeling we must forbid peeking at future tokens.

Causal masking enforces this rule so that the model is always predicting the next token only from tokens that came before.

(more…)

Query, Key, Value in Attention

LLM Architecture Series – Lesson 9 of 20. Multi head attention relies on three sets of vectors called queries, keys, and values.

These vectors control how positions compare to each other and how information flows across the sequence.

(more…)

Understanding Self-Attention – Part 2 (Multi-Head)

LLM Architecture Series – Lesson 8 of 20. You have seen single head self attention. In practice transformers use many attention heads in parallel.

Each head can focus on different kinds of patterns such as subject verb agreement, long range dependencies, or punctuation structure.

(more…)

Understanding Self-Attention – Part 1

LLM Architecture Series – Lesson 7 of 20. So far each position has a clean, normalised embedding. Now the model needs a way for each position to look at other positions.

Self attention lets every token decide which other tokens are important for predicting the next output.

(more…)

Layer Normalization in Transformers

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalisation.

Layer normalisation rescales activations so that the following layers behave in a stable and predictable way.

(more…)

The Combined Input Embedding

LLM Architecture Series – Lesson 5 of 20. You have seen token embeddings and position embeddings. Now we put them together into the actual input that flows into the transformer.

This combined embedding is what every later layer sees for each position in the sequence.

(more…)

Position Embeddings – Encoding Word Order

LLM Architecture Series – Lesson 4 of 20. So far we know the meaning of each token. Now we teach the model where each token sits in the sequence.

Position embeddings make it possible for the model to tell the difference between subject and object, start and end, and repeated words at different places.

(more…)

Token Embeddings – Converting Words to Vectors

LLM Architecture Series – Lesson 3 of 20. Previously you learned how text becomes tokens. Now we look at how each token becomes a vector.

This step gives the model a continuous representation of meaning so that similar tokens end up near each other in vector space.

(more…)

How LLMs Process Text – Tokenization Basics

LLM Architecture Series – Lesson 2 of 20. In the previous lesson you saw the full architecture. Now we zoom into the very first step, tokenization.

We turn text into a sequence of tokens so that the model can work with discrete pieces instead of raw characters.

(more…)