The Projection Layer

LLM Architecture Series – Lesson 12 of 20. The attention heads produce outputs that must be merged and projected back into the model hidden space.

This is done by a learned linear projection that mixes information from all heads into a single vector per position.

(more…)

CategoriesAI

Attention Scores and Softmax

LLM Architecture Series – Lesson 11 of 20. We now have raw attention scores between tokens. These scores must be turned into normalized weights.

The softmax over attention scores produces a probability distribution that says how much each token should influence the current position.

(more…)

CategoriesAI

Causal Masking – Preventing Future Leakage

LLM Architecture Series – Lesson 10 of 20. The attention mechanism is powerful, but for language modeling we must forbid peeking at future tokens.

Causal masking enforces this rule so that the model is always predicting the next token only from tokens that came before.

(more…)

CategoriesAI

Query, Key, Value in Attention

LLM Architecture Series – Lesson 9 of 20. Multi head attention relies on three sets of vectors called queries, keys, and values.

These vectors control how positions compare to each other and how information flows across the sequence.

(more…)

CategoriesAI

Understanding Self-Attention – Part 2 (Multi-Head)

LLM Architecture Series – Lesson 8 of 20. You have seen single head self attention. In practice transformers use many attention heads in parallel.

Each head can focus on different kinds of patterns such as subject verb agreement, long range dependencies, or punctuation structure.

(more…)

CategoriesAI

Understanding Self-Attention – Part 1

LLM Architecture Series – Lesson 7 of 20. So far each position has a clean, normalized embedding. Now the model needs a way for each position to look at other positions.

Self attention lets every token decide which other tokens are important for predicting the next output.

(more…)

CategoriesAI

Layer Normalization in Transformers

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalization.

Layer normalization rescales activations so that the following layers behave in a stable and predictable way.

(more…)

CategoriesAI

The Combined Input Embedding

LLM Architecture Series – Lesson 5 of 20. You have seen token embeddings and position embeddings. Now we put them together into the actual input that flows into the transformer.

This combined embedding is what every later layer sees for each position in the sequence.

(more…)

CategoriesAI

Position Embeddings – Encoding Word Order

LLM Architecture Series – Lesson 4 of 20. So far we know the meaning of each token. Now we teach the model where each token sits in the sequence.

Position embeddings make it possible for the model to tell the difference between subject and object, start and end, and repeated words at different places.

(more…)

CategoriesAI

Token Embeddings – Converting Words to Vectors

LLM Architecture Series – Lesson 3 of 20. Previously you learned how text becomes tokens. Now we look at how each token becomes a vector.

This step gives the model a continuous representation of meaning so that similar tokens end up near each other in vector space.

(more…)

CategoriesAI