LLM Architecture Series – Lesson 11 of 20. We now have raw attention scores between tokens. These scores must be turned into normalised weights.
The softmax over attention scores produces a probability distribution that says how much each token should influence the current position.
