LLM Architecture Series – Lesson 12 of 20. The attention heads produce outputs that must be merged and projected back into the model hidden space.
This is done by a learned linear projection that mixes information from all heads into a single vector per position.
