Layer Normalization in Transformers

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalisation.

Layer normalisation rescales activations so that the following layers behave in a stable and predictable way.

(more…)

The Combined Input Embedding

LLM Architecture Series – Lesson 5 of 20. You have seen token embeddings and position embeddings. Now we put them together into the actual input that flows into the transformer.

This combined embedding is what every later layer sees for each position in the sequence.

(more…)

Position Embeddings – Encoding Word Order

LLM Architecture Series – Lesson 4 of 20. So far we know the meaning of each token. Now we teach the model where each token sits in the sequence.

Position embeddings make it possible for the model to tell the difference between subject and object, start and end, and repeated words at different places.

(more…)

Introduction to Large Language Models

LLM Architecture Series – Lesson 1 of 20. This article gives you the big picture of a modern language model before we zoom into each part.

You can think of a large language model as a very advanced auto complete engine that predicts the next token based on everything it has seen so far.

(more…)

LLM Architecture Series – Complete Guide

LLM Architecture Overview - Annotated

Visualisation from bbycroft.net/llm – Annotated with Nano Banana

Welcome to the LLM Architecture Series

This comprehensive 20-part series takes you from the fundamentals to advanced concepts in Large Language Model architecture. Using interactive visualisations from Brendan Bycroft’s excellent LLM Visualisation, we explore every component of a GPT-style transformer.

Series Overview

Part 1: Foundations (Articles 1-5)

  1. Introduction to Large Language Models – What LLMs are and how they work
  2. Tokenization Basics – Converting text to tokens
  3. Token Embeddings – Converting tokens to vectors
  4. Position Embeddings – Encoding word order
  5. Combined Input Embedding – Putting it together

Part 2: The Transformer Block (Articles 6-14)

  1. Layer Normalisation – Stabilizing the network
  2. Self-Attention Part 1 – The core innovation
  3. Self-Attention Part 2 – Multi-head attention
  4. Query, Key, Value – The attention framework
  5. Causal Masking – Preventing future leakage
  6. Attention Softmax – Computing attention weights
  7. Projection Layer – Combining attention outputs
  8. Feed-Forward Networks – The MLP component
  9. Residual Connections – Skip connections for depth

Part 3: The Complete Model (Articles 15-20)

  1. Complete Transformer Block – All components together
  2. Stacking Layers – Building depth
  3. Output Layer – The language model head
  4. Output Softmax – From logits to probabilities
  5. Scaling LLMs – From nano-GPT to GPT-3
  6. Complete Pipeline – The full picture

About This Series

Each article includes:

  • Interactive visualisations from bbycroft.net/llm
  • Mathematical equations explaining each component
  • Intuitive explanations of why each part matters
  • Navigation links to previous and next articles

Start Learning

Begin with: Introduction to Large Language Models


Interactive visualisations courtesy of bbycroft.net/llm by Brendan Bycroft. Annotated images created with Nano Banana.

AI Inference Pipeline Explained

Inference

AI Inference: From Input to Output

Inference is when a trained model makes predictions on new data. While training happens once, inference runs millions of times in production. Understanding the inference pipeline is crucial for deploying AI systems that are fast, efficient, and reliable.

The pipeline involves preprocessing inputs (tokenizing text, resizing images, normalising values), running the forward pass through model layers, and postprocessing outputs (decoding tokens, formatting responses, applying thresholds). Each stage offers optimisation opportunities.

Common optimisations include quantization (reducing precision from FP32 to INT8), pruning (removing unimportant weights), distillation (training smaller models to mimic larger ones), and batching (processing multiple inputs together for GPU efficiency).

Latency considerations matter for user experience. First token latency determines perceived responsiveness. Tokens per second affects total generation time. For production systems, balance quality against speed based on your specific use case and user expectations.

Fine-Tuning AI Models Explained

Fine-tuning

Fine-Tuning: Adapting Pre-trained Models

Fine-tuning transfers knowledge from large pre-trained models to specific tasks with limited data. This paradigm revolutionized AI by making powerful models accessible without requiring massive datasets or compute budgets for each new application.

The process involves taking a model pre-trained on massive general data, then continuing training on task-specific data with a lower learning rate. The pre-trained weights provide excellent initialisation, and fine-tuning adapts them to the new domain.

Different strategies trade off between performance and efficiency. Full fine-tuning updates all parameters for best results but requires significant compute. LoRA adds small trainable matrices alongside frozen weights, achieving 90%+ of full fine-tuning performance with a fraction of parameters.

Best practices include using lower learning rates than pre-training, employing warmup steps, monitoring validation loss for early stopping, and considering gradual unfreezing where deeper layers train first before unfreezing earlier layers.

GPT Architecture Explained

GPT

GPT: Inside Large Language Models

GPT (Generative Pre-trained Transformer) uses a decoder-only architecture to generate text token by token. Understanding its internals reveals how modern AI achieves remarkable language capabilities through elegant architectural choices.

Each token receives embedding plus positional encoding, then passes through stacked Transformer blocks. Each block contains masked self-attention (can only attend to previous tokens) and feed-forward networks. This autoregressive structure enables coherent text generation.

Scale proves critical for emergent capabilities. GPT-2 had 1.5B parameters, GPT-3 scaled to 175B, and GPT-4 reportedly exceeds a trillion. With scale comes surprising abilities: few-shot learning, reasoning, and code generation that emerge without explicit training.

Generation works by processing input tokens through all layers, predicting probability distribution over vocabulary for the next token, sampling from that distribution, appending the result, and repeating. Temperature and top-p sampling control creativity versus coherence.