SudoAll.com

Layer Normalization in Transformers

Posted on December 1, 2025March 3, 2026 by David Saliba

LLM Architecture Series – Lesson 6 of 20. At this point each position has a combined embedding. Before attention or feed forward layers, the model often applies layer normalisation.

Layer normalisation rescales activations so that the following layers behave in a stable and predictable way.

(more…)

The Combined Input Embedding

Posted on December 1, 2025March 3, 2026 by David Saliba

LLM Architecture Series – Lesson 5 of 20. You have seen token embeddings and position embeddings. Now we put them together into the actual input that flows into the transformer.

This combined embedding is what every later layer sees for each position in the sequence.

(more…)

Position Embeddings – Encoding Word Order

Posted on December 1, 2025March 3, 2026 by David Saliba

LLM Architecture Series – Lesson 4 of 20. So far we know the meaning of each token. Now we teach the model where each token sits in the sequence.

Position embeddings make it possible for the model to tell the difference between subject and object, start and end, and repeated words at different places.

(more…)

Token Embeddings – Converting Words to Vectors

Posted on December 1, 2025March 3, 2026 by David Saliba

LLM Architecture Series – Lesson 3 of 20. Previously you learned how text becomes tokens. Now we look at how each token becomes a vector.

This step gives the model a continuous representation of meaning so that similar tokens end up near each other in vector space.

(more…)

How LLMs Process Text – Tokenization Basics

Posted on December 1, 2025March 3, 2026 by David Saliba

LLM Architecture Series – Lesson 2 of 20. In the previous lesson you saw the full architecture. Now we zoom into the very first step, tokenization.

We turn text into a sequence of tokens so that the model can work with discrete pieces instead of raw characters.

(more…)

Introduction to Large Language Models

Posted on December 1, 2025March 5, 2026 by David Saliba

LLM Architecture Series – Lesson 1 of 20. This article gives you the big picture of a modern language model before we zoom into each part.

You can think of a large language model as a very advanced auto complete engine that predicts the next token based on everything it has seen so far.

(more…)

LLM Architecture Series – Complete Guide

Posted on December 1, 2025March 3, 2026 by David Saliba

LLM Architecture Overview - Annotated

Visualisation from bbycroft.net/llm – Annotated with Nano Banana

Welcome to the LLM Architecture Series

This comprehensive 20-part series takes you from the fundamentals to advanced concepts in Large Language Model architecture. Using interactive visualisations from Brendan Bycroft’s excellent LLM Visualisation, we explore every component of a GPT-style transformer.

Series Overview

Part 1: Foundations (Articles 1-5)

Introduction to Large Language Models – What LLMs are and how they work
Tokenization Basics – Converting text to tokens
Token Embeddings – Converting tokens to vectors
Position Embeddings – Encoding word order
Combined Input Embedding – Putting it together

Part 2: The Transformer Block (Articles 6-14)

Layer Normalisation – Stabilizing the network
Self-Attention Part 1 – The core innovation
Self-Attention Part 2 – Multi-head attention
Query, Key, Value – The attention framework
Causal Masking – Preventing future leakage
Attention Softmax – Computing attention weights
Projection Layer – Combining attention outputs
Feed-Forward Networks – The MLP component
Residual Connections – Skip connections for depth

Part 3: The Complete Model (Articles 15-20)

Complete Transformer Block – All components together
Stacking Layers – Building depth
Output Layer – The language model head
Output Softmax – From logits to probabilities
Scaling LLMs – From nano-GPT to GPT-3
Complete Pipeline – The full picture

About This Series

Each article includes:

Interactive visualisations from bbycroft.net/llm
Mathematical equations explaining each component
Intuitive explanations of why each part matters
Navigation links to previous and next articles

Start Learning

Begin with: Introduction to Large Language Models

Interactive visualisations courtesy of bbycroft.net/llm by Brendan Bycroft. Annotated images created with Nano Banana.

AI Inference Pipeline Explained

Posted on June 22, 2025March 3, 2026 by David Saliba

Inference

AI Inference: From Input to Output

Inference is when a trained model makes predictions on new data. While training happens once, inference runs millions of times in production. Understanding the inference pipeline is crucial for deploying AI systems that are fast, efficient, and reliable.

The pipeline involves preprocessing inputs (tokenizing text, resizing images, normalising values), running the forward pass through model layers, and postprocessing outputs (decoding tokens, formatting responses, applying thresholds). Each stage offers optimisation opportunities.

Common optimisations include quantization (reducing precision from FP32 to INT8), pruning (removing unimportant weights), distillation (training smaller models to mimic larger ones), and batching (processing multiple inputs together for GPU efficiency).

Latency considerations matter for user experience. First token latency determines perceived responsiveness. Tokens per second affects total generation time. For production systems, balance quality against speed based on your specific use case and user expectations.

Fine-Tuning AI Models Explained

Posted on June 12, 2025March 3, 2026 by David Saliba

Fine-tuning

Fine-Tuning: Adapting Pre-trained Models

Fine-tuning transfers knowledge from large pre-trained models to specific tasks with limited data. This paradigm revolutionized AI by making powerful models accessible without requiring massive datasets or compute budgets for each new application.

The process involves taking a model pre-trained on massive general data, then continuing training on task-specific data with a lower learning rate. The pre-trained weights provide excellent initialisation, and fine-tuning adapts them to the new domain.

Different strategies trade off between performance and efficiency. Full fine-tuning updates all parameters for best results but requires significant compute. LoRA adds small trainable matrices alongside frozen weights, achieving 90%+ of full fine-tuning performance with a fraction of parameters.

Best practices include using lower learning rates than pre-training, employing warmup steps, monitoring validation loss for early stopping, and considering gradual unfreezing where deeper layers train first before unfreezing earlier layers.

GPT Architecture Explained

Posted on June 2, 2025 by David Saliba

GPT

GPT: Inside Large Language Models

GPT (Generative Pre-trained Transformer) uses a decoder-only architecture to generate text token by token. Understanding its internals reveals how modern AI achieves remarkable language capabilities through elegant architectural choices.

Each token receives embedding plus positional encoding, then passes through stacked Transformer blocks. Each block contains masked self-attention (can only attend to previous tokens) and feed-forward networks. This autoregressive structure enables coherent text generation.

Scale proves critical for emergent capabilities. GPT-2 had 1.5B parameters, GPT-3 scaled to 175B, and GPT-4 reportedly exceeds a trillion. With scale comes surprising abilities: few-shot learning, reasoning, and code generation that emerge without explicit training.

Generation works by processing input tokens through all layers, predicting probability distribution over vocabulary for the next token, sampling from that distribution, appending the result, and repeating. Temperature and top-p sampling control creativity versus coherence.

← Newer posts Older posts →