Overfitting vs Underfitting Explained

Overfitting

The Generalisation Challenge

The core challenge in machine learning is generalising from training data to new examples. A model that memorises training data perfectly but fails on new data is useless. Understanding overfitting and underfitting is essential for building models that actually work in production.

Underfitting occurs when the model is too simple to capture patterns. High bias leads to poor performance on both training and test data. Solutions include increasing model capacity, adding features, training longer, or reducing regularization.

Overfitting happens when the model memorises training data including noise. High variance means great training performance but poor test performance. Solutions include more training data, regularization techniques, dropout, early stopping, and data augmentation.

The bias-variance tradeoff underlies this challenge. Simple models have high bias but low variance. Complex models have low bias but high variance. The goal is finding the sweet spot where total error is minimised for your specific application.

Gradient Descent Optimization Explained

Gradient Descent

Gradient Descent: Finding Optimal Solutions

Gradient descent is the optimisation engine powering machine learning. It finds minimum error by iteratively moving in the direction of steepest descent on the loss surface. Understanding this algorithm is fundamental to training any neural network effectively.

Three main variants exist. Batch gradient descent uses the entire dataset per update – stable but slow. Stochastic gradient descent uses one sample – noisy but fast with implicit regularization. Mini-batch balances both using small batches of 32-512 samples.

Advanced optimisers improve upon vanilla gradient descent. Momentum accumulates velocity to escape local minima. RMSprop adapts learning rates per parameter. Adam combines both approaches and has become the default choice for most deep learning applications.

Learning rate scheduling further enhances training. Start with larger rates for fast progress, decay over time for fine convergence. Popular schedules include step decay, cosine annealing, and warmup followed by decay.

Tokenization in NLP Explained

Tokenization

Tokenization: Breaking Text for AI

Before AI can process text, it must be split into tokens – the fundamental units the model works with. Tokenization strategy significantly impacts model performance, vocabulary size, and ability to handle rare or novel words. Getting it right matters more than many realise.

Word-level tokenization splits on whitespace, creating intuitive tokens but massive vocabularies. Unknown words become impossible to handle. Character-level uses each character as a token, handling any text but losing word-level semantics and creating very long sequences.

Subword tokenization like BPE and WordPiece offers the best of both worlds. Common words remain whole while rare words split into meaningful subwords. The vocabulary stays manageable at 32K-100K tokens while handling novel words by decomposition.

Modern tokenizers also handle special tokens for model control: beginning and end of sequence markers, padding tokens, and special separators. Understanding tokenization helps debug model behaviour – sometimes strange outputs trace back to unexpected token boundaries.

Word Embeddings Explained

Embeddings

Word Embeddings: Meaning in Vectors

Word embeddings convert discrete tokens into dense vectors where semantic similarity corresponds to geometric proximity. This representation enables neural networks to understand that king and queen are related, or that Paris relates to France as Rome relates to Italy.

Early methods like Word2Vec learn embeddings by predicting context from words or vice versa. GloVe factors word co-occurrence matrices for global statistics. These static embeddings assign one vector per word regardless of context, missing nuances like bank meaning riverbank versus financial institution.

Contextual embeddings from BERT and GPT solved this limitation. The same word gets different representations based on surrounding context. This dynamic understanding dramatically improved performance on tasks requiring disambiguation and nuanced comprehension.

Embeddings reveal learned structure through vector arithmetic. The classic example: king minus man plus woman equals queen. These relationships emerge unsupervised from training data, demonstrating that neural networks discover meaningful semantic structure without explicit guidance.

Self-Attention Mechanism Explained

Attention

Self-Attention: The Core of Transformers

Self-attention allows each element in a sequence to attend to all other elements, capturing relationships regardless of distance. This mechanism revolutionized NLP by replacing recurrence with parallelisable attention operations that model long-range dependencies more effectively.

For each token, three vectors are computed: Query (what am I looking for), Key (what do I contain), and Value (what information do I provide). Attention weights come from the dot product of Query and Key, normalised by softmax. These weights determine how much each token contributes to the output.

Multi-head attention runs multiple attention mechanisms in parallel, each learning to focus on different relationship types. One head might capture syntactic dependencies while another focuses on semantic similarity. Concatenating and projecting results combines these perspectives.

The attention mechanism enables remarkable capabilities in language models. A model can understand that a pronoun refers to a noun mentioned paragraphs earlier, or that a technical term definition from the context applies throughout. This contextual understanding underpins modern LLM performance.

LSTM Networks Explained

LSTM Cell Architecture Diagram

Introduction: The Power of LSTMs

The Long Short-Term Memory (LSTM) network is a specialised kind of Recurrent Neural Network (RNN) architecture, designed specifically to solve the problem of vanishing gradients that plagues traditional RNNs when dealing with long sequences of data.

While standard RNNs struggle to retain information from many steps ago, LSTMs are engineered with a dedicated Cell State (Ct)—often called the “conveyor belt”—that runs straight through the network. This Cell State is regulated by three distinct, multiplicative gates (Forget, Input, and Output) that learn to selectively remember or forget information, allowing the network to capture and utilise long-term dependencies in sequential data like text, speech, and time series. The mathematical equations below illustrate how these gates precisely control the flow of both long-term memory (Ct) and short-term output (ht).


LSTM Cell Architecture Equations

This breakdown translates the LSTM diagram into its corresponding mathematical equations, showing exactly how the inputs (xt, ht-1, Ct-1) are processed to generate the outputs (ht, Ct).

The σ symbol represents the Sigmoid function, and W and b represent the weight matrices and bias vectors learned during training.

1. The Gates (Control)

The first step is calculating the three gates, each using the current input (xt) and the previous hidden state (ht-1) and applying a sigmoid function (σ):

A. Forget Gate (ft):

ft = σ(Wf · [ht-1, xt] + bf)

Purpose: Decides which information to forget from the old cell state (Ct-1).

B. Input Gate (it):

it = σ(Wi · [ht-1, xt] + bi)

Purpose: Decides which values to update in the cell state.

C. Candidate Cell State (ᶜt):

t = tanh(WC · [ht-1, xt] + bC)

Purpose: Creates a vector of potential new values that could be added to the cell state.

2. Cell State Update (The Memory)

The Cell State (Ct) is the core memory of the LSTM, updated by combining the old memory and the new candidate memory:

Ct = ft ∗ Ct-1 + it ∗ ᶜt

  • The term ft * Ct-1 implements the forgetting mechanism: the old memory Ct-1 is scaled down by the Forget Gate ft.
  • The term it * ᶜt implements the input mechanism: the new candidate information ᶜt is scaled by the Input Gate it.
  • These two parts are then added to create the new long-term memory, Ct.

3. Hidden State Output (The Prediction)

The Hidden State (ht) is the final output of the cell at this time step. It is based on the new Cell State, filtered by the Output Gate:

A. Output Gate (ot):

ot = σ(Wo · [ht-1, xt] + bo)

Purpose: Decides which parts of the (squashed) Cell State will be exposed as the Hidden State.

B. Final Hidden State (ht):

ht = ot ∗ tanh(Ct)

  • The new Cell State Ct is passed through tanh to bound the values between -1 and 1.
  • The result is then element-wise multiplied by the Output Gate ot to produce the final short-term memory and output vector, ht.

Conclusion: The Importance of Selective Memory

The LSTM architecture, as described by these equations, fundamentally improved the capability of recurrent neural networks to model complex dependencies over long sequences. By using three learned, multiplicative gates to regulate the flow into and out of the Cell State, the LSTM is able to maintain a stable, uncorrupted memory path, overcoming the practical limitations of standard RNNs.

This innovation has made LSTMs essential tools in areas requiring deep contextual understanding, leading to breakthroughs in speech recognition, machine translation, and text generation, before the wider adoption of the Transformer architecture.

Next Steps

Interested in a simpler alternative? Check out the GRU (Gated Recurrent Unit), which combines the forget and input gates into a single update gate—achieving similar performance with fewer parameters. For cutting-edge sequence modeling, explore how Transformers use attention mechanisms to process entire sequences in parallel, bypassing recurrence altogether.

RNNs: Processing Sequential Data

RNN

Recurrent Neural Networks: Processing Sequences

RNNs were designed for sequential data – text, time series, audio, and video. Unlike feedforward networks that process fixed-size inputs, RNNs maintain a hidden state that acts as memory, allowing information to persist across the sequence and enabling context-aware processing.

At each timestep, the hidden state combines the previous state with new input through learned transformations. This recurrence creates a computational graph that unfolds through time, theoretically allowing information from early inputs to influence later processing indefinitely.

In practice, vanilla RNNs struggle with long sequences due to vanishing gradients. When backpropagating through many timesteps, gradients shrink exponentially, causing the network to forget early information. Exploding gradients present the opposite problem, causing training instability.

While largely superseded by Transformers for most applications, understanding RNNs remains valuable. They introduced key concepts like sequence modeling and temporal dependencies. LSTM and GRU variants solved the gradient problems, and some real-time applications still benefit from RNNs streaming nature.

CNNs: How AI Sees Images

CNN

Convolutional Neural Networks: How AI Sees

CNNs revolutionized computer vision by mimicking how the visual cortex processes images. Small learnable filters slide across the image detecting features, with early layers finding edges and later layers identifying complex objects. This hierarchical feature learning made accurate image recognition possible.

Key components include convolutional layers where filters detect local patterns, pooling layers that reduce spatial dimensions while preserving important features, and fully connected layers that combine features for final classification. The architecture dramatically reduces parameters through weight sharing.

CNNs achieve translation invariance – they detect features regardless of position in the image. A cat in the corner is recognised the same as one in the centre. This property emerges naturally from the sliding filter approach and makes CNNs robust to object placement.

Famous architectures include LeNet (1998), AlexNet (2012) which sparked the deep learning revolution, VGG demonstrating depth matters, ResNet enabling 100+ layer networks with skip connections, and modern EfficientNets balancing accuracy and efficiency. Each advanced our understanding of what makes CNNs effective.

Activation Functions Explained

Activation

Activation Functions: The Key to Non-Linearity

Without activation functions, neural networks would be limited to linear transformations no matter how many layers they have. Activations introduce non-linearity, enabling networks to learn complex patterns like image recognition and language understanding that linear models cannot capture.

ReLU (Rectified Linear Unit) outputs max(0,x) – simple, fast, and surprisingly effective. It has become the default for hidden layers. Sigmoid squashes output to 0-1, useful for binary classification but prone to vanishing gradients. Tanh outputs -1 to 1, zero-centred which sometimes helps training.

For output layers, the choice depends on your task. Sigmoid for binary classification, softmax for multi-class (outputs sum to 1 as probabilities), and linear for regression. Modern variants like GELU and Swish offer slight improvements in specific contexts.

Understanding activations helps diagnose training issues. Dead ReLU neurons that never activate, saturated sigmoids causing vanishing gradients, and numerical instability all relate to activation choice. Experimentation within established guidelines usually yields good results.

Backpropagation Algorithm Explained

Backprop

Backpropagation: The Learning Algorithm

Backpropagation is the algorithm that makes deep learning possible. It efficiently computes how much each weight in a neural network contributed to the prediction error, enabling targeted updates that improve performance. Without backprop, training deep networks would be computationally infeasible.

The algorithm applies the calculus chain rule to propagate error gradients backward through the network. Starting from the output layer, it calculates local gradients then multiplies by downstream gradients to determine each weights contribution to the loss. This recursive computation handles arbitrarily deep networks.

The beauty of backpropagation lies in its efficiency. It computes all gradients in a single backward pass through the network, achieving O(n) complexity. Computing each gradient independently would require O(n squared) forward passes, making training prohibitively slow for modern architectures.

Understanding backprop illuminates common training issues. Vanishing gradients occur when gradients shrink exponentially through layers. Exploding gradients cause instability. Techniques like gradient clipping, proper initialisation, and batch normalisation address these issues while preserving backprops fundamental efficiency.