When you type a prompt and hit enter, you’re not “asking a brain” — you’re running a gigantic math function. That function is defined by hundreds of millions or billions of parameters (the “weights”) stored in a model file. Understanding that one idea clears up a lot of confusion about why LLMs need so much memory, why they’re deterministic at a fixed temperature, and why “thinking” is just more matrix multiplications.
Think of the weights as a fixed recipe. The recipe never changes during one run. The only thing that changes is the input you feed in: the sequence of token IDs representing your text. Each token is a small integer (e.g. 0 to 50,257 in GPT-2). The model looks up an “embedding” vector for each token, runs those vectors through dozens of transformer layers (attention and feed-forward steps), and at the end it has a vector of scores — one per possible next token. Sampling from those scores gives the next token; that token gets appended and the process repeats. So “generation” is a loop: embed → transform → score → sample → append → repeat.
Under the hood, each layer is doing the same kind of linear algebra: matrix multiplies, softmax for attention, layer norms. The weights are just big matrices. Inference is a single forward pass per new token, with no gradient computation. That’s why it’s cheaper than training, but still compute-heavy: you’re still doing billions of multiply-adds per token for a 7B model.
The practical implications are everywhere. Bigger models need more VRAM because the weight matrices must fit in memory. Quantization helps by storing weights in lower precision (e.g. 4-bit instead of 16-bit), which trades a bit of quality for a lot less memory and faster math. Serving at scale is mostly about batching these forward passes and keeping the GPU busy.
We’re still in the early phase of making this pipeline efficient: better kernels, smarter batching, and distillation will keep pushing what’s possible on a single machine or a small cluster.
nJoy 😉
