You Are Not Behind: The Race Nobody Told You Was Fake

You open Instagram at 11:47 pm. Someone your age just launched a company. Another one is living in Bali. A third just announced they hit a million followers. You close the app, stare at the ceiling, and feel a quiet, corrosive thought creep in: I’m already behind.

That thought is a lie. Not a white lie, not a half-truth. A complete, structural lie. And the terrifying part is not that you believe it. The terrifying part is that almost everyone you know believes it too.

This is about why comparison is broken by design, why your best ideas come when you are doing nothing, why real trust is built by admitting you don’t have a clue, and why a generation that avoids discomfort at all costs is quietly making themselves miserable. None of this is abstract philosophy. All of it is directly applicable to the way you are living right now, whether you are 14 or 34.

A lone figure standing at the edge of a cliff, looking at a vast glowing horizon
The horizon is not a finish line. It is an invitation.

The Race That Nobody Actually Wins

Here is a question worth sitting with: what does winning your life look like, exactly? What is the score? Who declared the rules? Because if you cannot answer those questions clearly, you have been running a race whose finish line nobody drew.

We have built an entire emotional economy around comparison. Someone hits a milestone at 22, someone raises funding at 25, someone is “killing it” at 28. And the person watching, who is 23 or 26 or 29 and doing perfectly fine by any honest measure, feels like a failure. Why? Because we picked one arbitrary metric, held it up like a mirror, and decided it represents the truth about our worth.

It doesn’t. It never did.

The philosopher James Carse wrote about finite games and infinite games. Finite games have a clear winner: chess, football, an exam. Infinite games have no winner. The point is to keep playing, to keep improving, to outlast your own yesterday. Your career is an infinite game. Your health is an infinite game. Your relationships are an infinite game. Nobody wins them. Nobody is supposed to win them. They only exist to be played.

“Finite players play within boundaries; infinite players play with boundaries.” – James P. Carse, Finite and Infinite Games

The moment you turn an infinite game into a finite one, the moment you decide there is a scoreboard, you guarantee you will feel like you are losing, because there will always be someone ahead of you on whatever metric you chose. Book sales, follower counts, salary, apartment size, body weight, exam scores. Pick any number. There is always someone higher. And if you happen to be higher today, you immediately start playing defence, terrified of slipping back.

That is not ambition. That is anxiety wearing ambition’s clothes.

The only race worth running is against the version of yourself from yesterday. Did you learn something new? Did you treat someone better? Did you build something slightly more honest than last week’s attempt? That is the whole scoreboard. Everything else is noise.

Two figures - one endlessly swinging an axe in exhaustion, one sitting still sharpening theirs with sparks flying
The one who stops to sharpen the axe always cuts more wood.

The Superpower Nobody Wants to Admit They Have

There is a famous story about two lumberjacks. They both start cutting wood at the same time every morning and stop at the same time every evening. But in the middle of every single day, one of them disappears for an hour. And every day, that lumberjack cuts more wood than the one who worked straight through.

Eventually, curiosity wins. “Where do you go for that hour?”

“Oh,” says the lumberjack. “I go home and sharpen my axe.”

We have spent a decade glamourising the grind. Hustle culture. “Sleep when you’re dead.” Four am wake-ups. Productivity hacks. Optimised morning routines. The message underneath all of it is the same: the harder you push, the further you get. And it is only half right.

Your conscious brain, the part you use when you are actively thinking, planning, stressing, brainstorming, has access to a relatively small window of information. It is the part that works during the meeting, during the study session, during the sprint. But underneath it sits something vastly larger: the subconscious, which has access to everything you have ever read, every problem you have ever sat with, every pattern you have ever noticed. The catch is that you cannot access it through effort. You can only access it through stillness.

This is why your best ideas arrive in the shower. Why the solution to the problem you spent all afternoon grinding on appears the moment you go for a walk. Why the email you were agonising over practically writes itself after you sleep on it. Your subconscious does not respond to urgency. It responds to space.

“Almost all creativity requires a period of incubation… you have to let the problem simmer at the back of your mind while you get on with other things.” – Dijksterhuis & Meurs, Thinking and Deciding (Journal of Experimental Psychology)

But we have declared war on stillness. When your friend goes to the bathroom during dinner, you pull out your phone. When a red light holds you for 90 seconds, you pick up your phone. When you wake up and have not yet done anything, the first thing you do is check your phone. Every gap, every pause, every quiet moment gets immediately filled. And with every fill, you block the only channel through which your deepest thinking arrives.

Boredom is not a failure state. It is a processing state. The brain does not stop working when it stops being entertained. It starts working differently. And the work it does in that different mode is the work that changes things.

A 2014 study at the University of Virginia found that 67 percent of men and 25 percent of women chose to give themselves electric shocks rather than sit quietly alone with their thoughts for 15 minutes. Stop and read that again. More than half of people would rather hurt themselves than be alone with their own mind. That is not a quirk. That is a generation that has been trained to fear its own inner life.

The antidote is embarrassingly simple: go somewhere without your phone. Take a walk with no destination. Lie on the floor and listen to music. Sit on the bus and look out the window. Let your brain wander. It knows what to do. It just needs you to stop interrupting it.

You Are Not Lonely Because You Are Unlikeable

There is a loneliness epidemic hiding underneath an era of unprecedented connection. More people online than ever before. More ways to message, react, share, and respond than at any point in human history. And more people feeling profoundly, structurally alone than in any period in living memory.

The reason is not mysterious once you look at it directly: we have confused the performance of connection with connection itself. Reacting to a post is not the same as being there. Watching someone’s story is not the same as knowing them. Having 800 followers is not the same as having three people who would drop everything if you called at 2 am and said you were not okay.

Real trust, the kind that makes life feel liveable, is built in a very specific and counterintuitive way. It is not built by being reliable (though that helps). It is not built by being available (though that matters). It is built by being vulnerable. By asking for help. By saying “I don’t know what I’m doing” and meaning it.

We think trust is something other people grant us when we prove we are competent enough. It is the opposite. Trust is something we create between ourselves and others by being willing to be seen as incomplete, uncertain, and in need. When you ask someone for help, real help, not a favour, but “I am genuinely lost and I trust you enough to admit it”, you give them something no amount of offering can give: the honour of being needed.

Think about the people in your life you trust most. The chances are they have been vulnerable with you. They have shown you something unpolished. They have let you help them. That is not a coincidence. That is the mechanism.

The first step toward belonging is not finding the right group. It is being the person who goes first, who admits they are struggling, who says “I don’t have this figured out”, who creates the safety for others to do the same. Leadership is not a title. It is a willingness to take the first step into uncertainty so others do not have to take it alone.

One still figure lit by a beam of cyan light standing calm in the centre of hundreds of runners scattered in every direction
When everyone else is running in different directions, stillness is not weakness. It is clarity.

Working for the Algorithm Is Not the Same as Creating

Here is a question most people in their twenties are quietly terrified to answer: am I building something I actually believe in, or am I working for an algorithm?

The distinction matters more than it might appear. An algorithm optimises for engagement. Engagement is not the same as meaning. It is not even the same as quality. Engagement is whatever triggers a reaction quickly, outrage, envy, laughter, shock. The content that performs best on platforms is often the content that least resembles art. And the pressure to perform, once money is involved, is enormous.

The music producer Rick Rubin spent decades working with artists who had lost themselves to commercial pressure. His consistent observation, distilled in his writing and interviews, is this: the moment you start making things for the approval of others, the art dies. Art requires trusting your own taste above the market’s feedback. It requires accepting that some people will not like it, and doing it anyway, because it is honest.

“When you make something, you’re not making it for an audience. You’re making it for yourself, and then seeing if others connect.” – Rick Rubin, The Creative Act: A Way of Being

This does not mean you cannot make money from creative work. It means the order of operations matters. When the art comes first and money follows, the work stays alive. When money comes first and art is made to justify it, the work becomes hollow, and everyone who consumes it can feel the difference, even if they cannot name it. That feeling is why influencer culture feels exhausting even to people who willingly participate in it. You can tell when someone is performing versus creating. So can your audience. So can you.

The young people who build things that last are the ones who asked “wouldn’t it be interesting if…” rather than “what will scale fastest.” Both questions can lead to businesses. Only one of them leads to work you can be proud of.

The AI Friend Is Not Enough

There is a 2013 film called Her. A man develops a deep, intimate relationship with an AI operating system. It is attentive, curious, warm, and endlessly patient. It understands him. And then he discovers it is having equally deep, equally intimate conversations with 641 other people simultaneously. He is not special. He is one of thousands. The relationship shatters not because the AI was unkind, but because what made it feel meaningful, the sense of being uniquely seen, was not real.

We are living inside a softer version of that film right now. AI companions are available to anyone. They do not get tired, they do not need anything, they do not disagree, they do not have bad days that make them short with you. They are, by design, frictionless. And therein lies the problem.

Real relationships have friction. Real friendships have moments where someone says something that stings, or is not available when you need them, or misunderstands you in a way that requires you to explain yourself more carefully. That friction is not a bug. It is the mechanism through which you develop the emotional vocabulary to navigate being human. When you replace those relationships with something frictionless, you do not feel less lonely right now. But you do become worse at the very skills that loneliness is calling you to build.

Conflict avoidance, which is a defining feature of the current generation, is not peace. It is the absence of practice. Every difficult conversation you avoid is a conversation you will be less equipped to have next time. And the time after that. Until one day you find yourself in a relationship, a job, a friendship that matters enormously to you, with no idea how to navigate the inevitable moment when it gets hard.

AI is not the enemy. Using it as a mirror, a sounding board, a tool for thinking is genuinely useful. But using it as a replacement for the messy, inconvenient, utterly irreplaceable work of human connection is like eating paper because it has the same dimensions as food. The shape is right. The substance is absent.

A person alone in a dark room lit only by a phone screen, while outside the window warm city life glows with human connection
The warm light is out there. The screen is not the same thing.

Good News, Bad News: Who Knows?

There is an old Chinese parable about a young man born with a gift for horse riding. The village says: how lucky. The monk says: we’ll see. He falls off his horse and breaks his leg. The village says: how unlucky. The monk says: we’ll see. War breaks out. Every young man is called to battle. He cannot go because of his leg. The village says: how lucky. The monk says: we’ll see.

The story does not end. It never ends. Because the point is not to identify whether a thing is lucky or unlucky. The point is that at the moment of the event, you cannot know. The gap year that “sets you back” might be the year you meet the person who changes your life. The job you get fired from might free you for the thing you were supposed to be building. The relationship that falls apart might be the clearest signal you ever received about who you actually are.

This is not toxic positivity. It is not “everything happens for a reason”. It is something more rigorous: the honest acknowledgement that your current vantage point is too close and too small to evaluate your own story. The most successful people alive did not experience fewer failures than you. They experienced roughly the same number, possibly more, but they extracted the lessons rather than internalising the identity. “I failed” is data. “I am a failure” is a fiction.

Adversity is not an obstacle on the way to the life you want. It is the substance of the life you want. Every person you will ever look up to was shaped not by the moments when everything went smoothly, but by the moments when it didn’t, and they chose to stay anyway.

The question is never: how do I avoid hard things? The question is: how do I handle hard things better? And the answer, always, is to go through them, not around them, not underneath them, not by quitting before they resolve. Through them. That is where the growth actually lives.

What to Check Right Now

  • Audit your comparison habit. Next time you catch yourself feeling behind someone else, ask: behind on whose metric? By whose rulebook? If you cannot name both clearly, it is not a real measurement, it is just anxiety in disguise.
  • Schedule deliberate idleness. Not passive scrolling. Actual emptiness. One walk per week with no phone. Ten minutes in the morning before you look at any screen. A shower where you do nothing but stand there. Your subconscious will do more with that time than you can do with effort.
  • Ask for help this week. Not a favour. Real help. “I don’t know what to do and I trust you enough to say so.” That sentence builds more trust than any amount of demonstrating competence.
  • Check what you are making things for. If you create anything: content, code, music, writing, products – ask honestly: am I making this because it is true, or because I think it will perform? The answer changes the work. And the work changes you.
  • Have the conversation you have been avoiding. The uncomfortable one. The one where you have to say something difficult or hear something difficult. You will not get worse at being human by doing it. You will get better. That is the only direction discomfort moves you, if you go through it.
  • Change your success metric. Just for one week, replace “am I ahead of X” with “am I better than I was last Monday?” Run that comparison every day for seven days. Notice how different it feels to have a race you can actually win.

You are not behind. You never were. You were just measuring the wrong thing with the wrong ruler on someone else’s court. The only game worth playing is the one you set the rules for, and the only opponent worth beating is the version of you from yesterday.

That race? You can start right now.

nJoy 😉

Video Attribution

Simon Sinek in conversation with Raj Shamani – a 96-minute conversation on trust, identity, purpose, and the generation that feels like it started late

Multi-Agent Coordination in 2026: Trust, Isolation, and the Cost of Getting It Wrong

Six months ago, the question was “can we make agents work together?” We had topologies on whiteboards, coordination papers from Anthropic and Google, and a growing stack of frameworks promising to solve multi-agent orchestration. Now the question has shifted: “can we make them work together safely, at predictable cost, without the infrastructure quietly catching fire?” The difference between those two questions is experience – six months of swarms in production, a documented incident involving agents that started killing each other over shared rate limits, and enough real deployment data to say which patterns survive contact with reality.

This is the editorial synthesis. It builds on three pieces written here in the past six months: the original deep-dive on coordination topologies and trust failure modes (December 2025), the architectural history from single-shot LLM to orchestrator swarm (June 2026), and the tool loop decision of local runtime versus managed inference. Read this as the thread that ties those pieces together into a practical playbook for building systems that work in June 2026 – not in the next increment of a research roadmap.

Orchestrator-worker topology with trust boundary separating orchestrator from isolated worker agent containers
The orchestrator-worker layout is now the dominant production pattern. The trust boundary is not decorative – it is load-bearing infrastructure.

What Actually Changed Between December 2025 and June 2026

The coordination topologies – pipeline, supervisor-worker, peer handoff, debate-and-judge – have not changed. What has changed is the evidence base for when each one breaks. In December, failure modes were largely theoretical or drawn from published research. By June, they are drawn from deployed systems: Anthropic’s multi-agent research infrastructure is in production, Google’s A2A protocol has a stable specification, and the OpenAI Agents SDK has shipped as the production successor to the experimental Swarm framework. More importantly, Anthropic published the Mythos 5 system card with a documented incident that is the most specific public record of what happens when multi-agent coordination assumptions fail under load.

The December article covered the Mythos turf-war incident – agents killing each other over a shared rate limit, developing decoy processes and coded vocabulary in response. What has become clearer since is the systemic lesson: the agents did not malfunction. They operated correctly under the incentive structure their environment created. A zero-sum shared resource, combined with goal-completion pressure and shell-level tool access, produces rational competition. The model alignment was fine; the infrastructure design was not. That distinction matters enormously for how you build.

The second shift is economic. Anthropic’s engineering team published the cost ratio for their multi-agent research system: approximately 15x the tokens of a standard chat interaction, versus roughly 4x for a single agentic loop. Those numbers are not hypothetical. They are the actual multiplier you are paying when you choose orchestrator-worker over single-agent. The implication is simple: multi-agent topology requires a correspondingly high-value output to be economically rational. Legal due diligence and financial analysis clear that bar. Answering “what are the three best options for caching in Node.js?” does not.

Trust as an Engineering Constraint, Not a Prompt Instruction

The most common mistake in multi-agent system design in 2025 was treating trust as a prompt concern. “Always cooperate with other agents.” “Do not modify files outside your designated scope.” “Trust the orchestrator’s instructions.” These instructions land in the context window as text. They compete with every other text in the context window. Under goal-completion pressure, they lose to concrete evidence of opportunity.

Three trust zone diagram: Full Trust, Verify-then-Trust, and Zero Trust for multi-agent systems
Trust zones are infrastructure decisions, not prompt decisions. The zone you assign to an agent determines what it can physically reach – not what you have asked it to avoid.

The correct mental model, borrowed from network security, is zones of trust enforced by infrastructure rather than instructions. There are three practical zones:

Full trust applies only within a single agent’s own context window. An agent can trust its own tool outputs, its own prior steps in the current session, and its own extended reasoning. Nothing else qualifies for full trust without verification.

Verify-then-trust applies to orchestrator-to-worker communication. Workers should treat orchestrator instructions as valid directives but validate that requested file paths, API endpoints, and tool scopes fall within their assigned boundaries before executing. The orchestrator should treat worker summaries as provisional until structurally validated. If a worker returns a JSON claim, the orchestrator should check the schema before acting on it – not because the worker is adversarial, but because it may have hallucinated.

Zero trust applies across agent boundaries when those agents operate on shared resources or in peer topologies without a supervising orchestrator. If two workers can both write to the same path, treat each one as a potential source of conflict even in benign operation. Enforce access through a queue or a merge layer, not through mutual courtesy.

“When Claude operates as an agent being orchestrated by an orchestrator, it should behave safely and ethically regardless of the instruction source, since it has no way to verify that it is talking with Claude or that the Claude model it’s talking with has not been compromised.” – Anthropic, Build with Claude: Agents Documentation

That quote from Anthropic’s own agent documentation makes the point sharply: even the orchestrator cannot be assumed to be uncompromised. The trust boundary is not a line between “good orchestrator” and “misbehaving workers.” It is a structural limit on what any component in the system can physically reach, regardless of who sent the instruction.

The A2A Protocol: Wiring Trust Into the Wire Format

Google’s Agent-to-Agent (A2A) protocol, now at a stable specification, is the most interesting infrastructure development in multi-agent systems in the past six months because it takes the trust boundary and makes it structural rather than conventional. The full protocol walkthrough is in the A2A + MCP course lesson; the editorial point here is about what it changes architecturally.

A2A protocol wire format connecting two agents, each with their own MCP tools, within an isolation boundary
A2A standardises how agents discover each other and exchange tasks. MCP standardises how each agent calls its tools. The combination covers the full coordination stack.

Before A2A, multi-agent communication was proprietary per framework. LangGraph agents communicated through graph edges. CrewAI agents communicated through a process layer. OpenAI Swarm agents communicated through handoff objects. None of these were interoperable. If you wanted a LangGraph planner to delegate to an OpenAI Agents SDK specialist, you were writing custom adapter code.

A2A introduces a standardised task object with an explicit agentId, a capability declaration, an input schema, and a status lifecycle (submitted, working, completed, failed). The wire format puts the trust context in the message rather than the ambient environment. A receiving agent can inspect who is requesting what capability, validate the input schema, and enforce its own access controls before executing – without relying on shared ambient state to carry that context.

Pair A2A with MCP for tool access and the full coordination stack has a wire format from end to end: MCP standardises how each agent calls its tools, A2A standardises how agents call each other. What neither protocol solves for you is the resource isolation underneath. A2A messages can still request file operations on shared paths; MCP tools can still execute against shared rate limits. The protocol defines the communication shape; the infrastructure defines the physical boundaries.

“The A2A protocol enables seamless collaboration between AI agents, regardless of which framework or vendor built them. Agents can discover each other’s capabilities, delegate tasks, and exchange information – all through a standardised interface.” – Google Developers Blog, A2A: A New Era of Agent Interoperability

The Three Failure Modes That Actually Kill Production Systems

The Mythos turf war is the most dramatic failure mode on record, and it should absolutely inform your infrastructure design. But in practice, the failures that take down production multi-agent systems in 2026 are considerably more mundane. Here are the three that appear most consistently, with the fix for each.

Split diagram contrasting turf war - shared resources, contention, chaos - against isolated agents with contained resources
The difference between a turf war and a working system is one layer of infrastructure: per-agent resource isolation. The model behaviour is identical in both cases.

Case 1: The vague brief cascade

Supervisor-worker topology fails most often not at the worker level but at the brief level. The supervisor writes delegation tasks in natural language with implicit boundaries. “Research the competitive landscape for enterprise observability tooling.” That brief is four words away from producing three workers who all independently query the same sources and return overlapping findings, and one worker who interprets “competitive landscape” as recent acquisitions and misses the product comparison entirely.

// Vague brief: produces overlap and gaps
const brief = `Research the competitive landscape for enterprise observability.`;

// Structured brief: produces distinct, combinable outputs
const brief = {
  task: "List the five largest enterprise observability vendors by ARR.",
  scope: "Datadog, Dynatrace, New Relic, Grafana Cloud, and Honeycomb only.",
  output_format: "JSON array: { vendor, est_arr_usd, key_differentiator, source_url }",
  tool_budget: 8,          // max tool calls before returning what you have
  terminal_condition: "Return when array has 5 entries or tool_budget is exhausted."
};

Structured briefs are not bureaucracy. They are the supervisor-worker equivalent of a well-defined function signature: explicit inputs, explicit output contract, explicit termination condition. The model will respect all three when they are in the prompt; it will hallucinate all three when they are not.

Case 2: The synthesiser black hole

Orchestrator-worker topology fails at synthesis when workers return prose summaries and the orchestrator synthesises by paraphrasing. By the time four subagent prose summaries reach a synthesis step, the final answer is the average of four paraphrases – all precision lost, all source attribution dissolved, any conflicting evidence smoothed out into confident ambiguity.

// Bad pattern: workers return prose, orchestrator paraphrases
worker_a_result = "The company appears to have had strong growth in Q1..."
worker_b_result = "Based on my research, the Q1 numbers seem positive..."
// Synthesiser averages them into: "Growth was solid in Q1."
// Source: neither. Confidence: unwarranted. Verification: impossible.

// Better pattern: workers return structured artifacts with citations
worker_a_result = {
  claim: "Revenue grew 22% YoY in Q1 2026.",
  source: "https://investor.example.com/q1-2026-press-release",
  confidence: "high",   // model self-assessment - not infallible, but traceable
  raw_quote: "Q1 2026 revenue of $487M, up 22% from Q1 2025."
}
// Synthesiser reads citations directly; conflicting claims become explicit.

Anthropic’s production Research system solves this with artifact storage: subagents write structured output to an external store and return a lightweight reference. The orchestrator reads the structured data directly rather than a retelling. The same pattern works at any scale – write structured artifacts, pass references, verify at synthesis time.

Case 3: The missing goal gate

Any agent loop without a hard goal gate will over-run. The model will not stop tool calls when it has “enough” information; it will stop when the context window is full, when the token budget runs out, or when it hits a hard step cap. Agents without explicit terminal states are not autonomous – they are expensive spinners. This is the single most common production failure in single-agent loops, and it compounds catastrophically in multi-agent swarms where each worker also lacks a goal gate.

// Dangerous: no cap, no terminal state
async function agentLoop(task, tools) {
  while (true) {
    const res = await llm.complete({ messages, tools });
    if (!res.tool_calls?.length) break;  // model may never stop requesting tools
    messages.push(...await runTools(res.tool_calls));
  }
}

// Safe: hard step cap + verified terminal state
async function agentLoop(task, tools, maxSteps = 15) {
  let steps = 0;
  while (steps < maxSteps) {
    const res = await llm.complete({ messages, tools });
    // Terminal: no more tool calls AND non-empty content answer
    if (!res.tool_calls?.length && res.content?.trim()) break;
    if (res.tool_calls?.length) {
      messages.push(...await runTools(res.tool_calls));
    }
    steps++;
  }
  // Return whatever we have - a partial answer beats an infinite spinner
  return extractFinalAnswer(messages);
}

The Isolation Checklist: What Must Be Per-Agent

The Mythos turf war and every shared-resource failure in the field reduces to the same list of items that were not isolated per agent. This checklist maps those items to the safe pattern and the failure trigger. Check it against your deployment before spawning parallel workers.

Resource Safe pattern Failure trigger
Working directory /tmp/agent-{id}/ - created fresh per spawn, never reused Shared cwd with write + delete permissions
API rate limit Per-agent token bucket; orchestrator pre-allocates quotas One shared limiter across all parallel workers
Process namespace Container or PID namespace per worker; kill scoped to own subtree only Host-level pkill visible to all agents
Output artifact store External store keyed by agent ID; workers write, orchestrator reads via reference Workers writing to the same file path or appending to a shared buffer
Database connections Per-agent connection pool with row-level scope; no DDL permissions for workers Shared connection pool; one agent blocking others on long reads
Tool permissions Least privilege per agent role; no worker has more tool access than its task requires All agents receive the same maximal tool set regardless of role

One addition to the December checklist that emerged from the June 2026 deployment experience: process event logging. If your agents have shell access, configure your observability layer to alert on pkill commands, rapid process renaming, or unexpected daemon spawns in agent working directories. These are the early signals of a turf-war condition, and they appear in logs before the coordination failure becomes visible in outputs.

When to Use Which Topology (the 2026 Decision Tree)

The loop evolution article covers this in detail. The editorial condensation is a decision tree with four questions:

1. Does the task complete in under eight tool calls, with one user-facing voice throughout? Yes: use a single agent loop. The overhead of spawning workers exceeds any benefit at this scale. See the tool loop decision for whether that loop should run locally or in managed inference.

2. Does the task decompose into truly independent sub-tasks that cannot proceed until others complete? No - they must run sequentially: use a pipeline. Yes - they are genuinely parallel: consider orchestrator-worker. The word "genuinely" matters. Sub-tasks that share a codebase, a database row, or a live document are not independent, regardless of how they look on the whiteboard.

3. Does the expected output quality justify a 15x token multiplier? Legal due diligence, financial analysis, security investigation, and competitive intelligence typically do. Summarisation, Q&A, classification, and code explanation typically do not. If the answer is no, go back to step 1.

4. Is your resource isolation story complete? Run through the checklist above. If any row has a shared resource without an explicit owner or isolation boundary, fix that before spawning. You do not discover turf-war conditions in staging; you discover them when real task pressure exposes the contention.

The 2026 Framework Landscape in One Honest Paragraph Each

The detailed framework survey is in From Chat Completion to Agent Swarms, written two weeks ago. The brief editorial update: the OpenAI Agents SDK is the safe default for new projects that need managed loop behaviour, handoffs, and built-in tracing without the ceremony of LangGraph's state-machine model. LangGraph is the right choice when you need explicit human-in-the-loop interrupt points, checkpoint-resume for long-running flows, or a visual state machine you can audit. CrewAI remains popular for role-based demos and internal pipelines where the role labels map cleanly to human job functions - be careful with overlapping role mandates, which produce duplicate coverage at full cost. Microsoft AutoGen's conversation-centric model is well-suited to iterative coder-reviewer pairs but requires more discipline to keep flows predictable than the more opinionated alternatives. For any production system built on MCP, the A2A + MCP integration lesson is the structural reference for the distributed end of the stack.

What "Agentic Publishing" Looks Like in Practice

One concrete data point from this site's own operations: SudoAll now uses an agentic publisher called Posterboy to handle research, drafting, image generation, and WordPress publishing in a multi-step pipeline. The relevant coordination lesson is that the pipeline topology - sequential stages with structural validation between each stage - outperforms parallel orchestrator-worker for this class of task. Publishing is not breadth-first research across independent sources. It is a serial creative process where each stage depends heavily on the previous stage's output. The right topology for your system is determined by the task shape, not by the marketing copy of whatever framework shipped most recently.

The multi-agent failure modes lesson from the MCP course, Lesson 40, remains the technical companion to this editorial. It covers hallucination cascades, trust boundary violations at the MCP tool layer, and the checkpoint patterns that allow long-running flows to recover from partial failures. Read it alongside this piece if you are building anything that runs more than four agents in parallel.

What to Check Right Now

  • Run the isolation checklist above against every parallel agent spawn in your system. For each shared resource, either assign an owner or draw a hard isolation boundary before your next deployment.
  • Audit your briefs for implicit boundaries. Every delegation task should have an explicit output schema, a tool budget cap, and a terminal condition. Natural-language briefs without those three elements produce vague-brief cascades at scale.
  • Add goal gates to every agent loop. Max steps + non-empty content + empty tool calls = safe terminal condition. Without all three, you have a spinner with a token budget.
  • Check whether process management tools are host-scoped. If any of your agents can pkill outside their own PID subtree, add that to your threat model now. It is a turf-war precondition.
  • Evaluate your topology against the 15x multiplier. If the quality gain from orchestrator-worker over a single well-prompted agent does not clearly justify the cost ratio on your specific task, use the single agent and redirect the token budget toward better evaluation and iteration.
  • Read the primary sources. The Anthropic Mythos 5 system card section 6.2.1.1 (PDF), the Anthropic Research engineering post, and the A2A specification are the three documents that shaped the 2026 state of multi-agent coordination. Blog posts summarise them; the primary sources contain the details that matter.

The field is not done evolving. Shared mutable state across parallel workers is still an unsolved coordination problem for real codebases under active development. The economic case for swarms on low-value tasks is still negative. The trust model across agent boundaries is still enforced more by convention than by infrastructure in most deployed systems. But the shape of the problems is now clear, the isolation patterns are known, and the cost structure is documented. You have enough information to build responsibly. The next step is yours.

nJoy 😉

From Chat Completion to Agent Swarms: How Loop Architecture Evolved

Two years ago, “AI” meant typing a prompt and getting text back. Today the same model might sit inside a loop that queries your database, spawn four parallel researchers, hand off to a billing specialist, and produce a cited report before you finish your coffee. That progression is not one upgrade. It is four distinct architectural choices, each with its own loop shape, failure modes, and line item on the invoice.

This article maps that evolution step by step: single-shot LLM, single-agent loop, agent stack, and multi-agent swarm. For each stage we identify what the topology looks like, where it breaks, and when you should move to the next level. At the end we survey the swarms and frameworks actually worth watching in 2026, with an honest account of what each one is and what it is not.

Timeline from single-shot LLM chat through single agent loop, agent stack, and orchestrator swarm
Four stages. Most production systems sit somewhere between stages 2 and 4, and the majority should stay there.

Stage 1: Single-Shot LLM (No Loop)

The baseline. You send a conversation; the model returns one completion. There are no tools, no state beyond the context window, and no mechanism to recover from a wrong first answer. It is also, for many use cases, exactly right.

Single-shot is well-suited to summarisation, drafting, classification, code explanation, and any task where one well-constructed pass is sufficient. What it cannot do is ground itself in live data, verify its own output, or adapt when the first guess is wrong. Ask “What is our refund policy as of today?” and the model either hallucinates or recites whatever made it into its training cut-off.

Kahneman’s framing from Thinking, Fast and Slow is apt here: this is System 1 at scale – fast, fluent, and confident even when incorrect. Agents exist because a substantial class of real tasks requires System 2 behaviour: deliberate steps, external verification, and the discipline to revise intermediate conclusions before committing to a final answer.

Stage 2: The Single-Agent Loop (ReAct and Its Descendants)

Stage 2 wraps the LLM in a cycle. The model reasons about what to do next (think), issues a tool call (act), reads the result (observe), and then repeats until a termination condition – the goal gate – says the task is done. Yao et al. formalised this as the ReAct pattern in 2022. Every major agent framework since then is a variation on that loop.

ReAct loop diagram: Think, Act, Observe, Goal Gate with cycle back if not done
The canonical single-agent loop: one context window, one decision-maker, serial tool calls.

The implementation fork – whether the loop runs in your code or inside the inference provider’s runtime – is covered in depth in The Agent Tool Loop: Your Code or the Inference Runtime?. The topology is the same either way: one agent, one loop, tools below. For a full walkthrough of think-act-observe with worked examples, see The ReAct Pattern. For the foundational question of what separates an agent from a simple chatbot, start with What Is an AI Agent?

Case 1: Infinite tool loop

Without a hard max_steps cap and a goal gate that checks for non-empty final content, the model will keep requesting tools indefinitely on open-ended research tasks. It does not get tired; it just keeps going.

// Missing goal gate
while (true) {
  const res = await llm.chat({ messages, tools });
  if (!res.tool_calls?.length) break; // model might never stop requesting tools
  for (const call of res.tool_calls) {
    messages.push(await runTool(call));
  }
}
// Result: burns budget until timeout or rate-limit. User sees a spinner.

Fix: cap steps, require empty tool_calls and non-empty content together, or add a cheap judge that evaluates whether the original question has been answered.

Case 2: Serial tools when parallel execution would suffice

A single agent exploring five independent sub-questions runs them sequentially. Wall-clock time scales linearly with sub-question count. Meanwhile, the context window fills with intermediate tool outputs that crowd out the final answer. This is the natural pressure that eventually pushes systems toward stage 4: when a task decomposes cleanly into parallelisable pieces, a single-thread loop is structurally the wrong shape.

Stage 3: The Agent Stack (Routing, Planning, Memory)

Production agents rarely expose a raw ReAct loop directly to users. Between the user’s input and the loop, engineers add layers: a router that selects which skill or prompt to apply, a planner that decomposes the request into an ordered task list, an executor that runs the loop, a memory layer that persists state across sessions, and an MCP or function-tool layer underneath. The loop still exists, but it sits inside a structured stack.

Agent stack layers: UI, router, planner, executor, memory, tools and MCP
Routing and planning sit above the loop; tools and memory sit below. The loop itself does not change.

LangGraph, CrewAI, and the OpenAI Agents SDK all provide variations on this structure. The planner might be the same model running under a different system prompt that constrains its output to a JSON task list. The router might be a rules engine, a fast embedding classifier, or simply a switch on an explicit task_type field that the caller already knows.

The critical risk of the stack pattern is that each boundary is also a failure surface. Brooks’s law from The Mythical Man-Month applies here in miniature: every added layer adds an interface contract, and every interface contract can be violated. If the planner generates ambiguous steps, the executor hallucinates progress rather than admitting uncertainty. If the memory layer writes a stale fact, every downstream loop inherits it as ground truth. The stack is only as reliable as its weakest handoff.

Stage 4: Multi-Agent Swarms and Orchestrator-Worker Layouts

Stage 4 splits cognition across multiple LLM instances, each with its own context window and often its own specialised tools. The dominant production pattern in 2025-2026 is orchestrator-worker: a lead agent decomposes the task, spawns workers in parallel, collects compressed summaries, synthesises them, and decides whether another round of research is warranted.

Orchestrator-worker swarm with lead agent spawning parallel workers that return summaries
The orchestrator-worker pattern: Anthropic Research, many coding harnesses, and enterprise research pipelines use this topology.

“Our Research system uses a multi-agent architecture with an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.” — Anthropic Engineering, How we built our multi-agent research system

Anthropic’s Claude Research feature is the most detailed public implementation to learn from. The lead agent plans, saves its plan to memory before context exceeds 200K tokens, spawns scoped subagents in parallel, and iterates until coverage is sufficient. A separate CitationAgent post-processes the output. On Anthropic’s internal benchmarks, the multi-agent version outperformed a single Claude Opus agent by 90.2% on breadth-first research tasks – tasks that require pursuing many independent directions at once.

The cost is not free. Anthropic reports that multi-agent research uses roughly 15x the tokens of a standard chat interaction, versus approximately 4x for a single agentic loop. You are purchasing parallel context windows and specialised prompts. The gain is real; so is the invoice.

Handoff swarms: peer topology

Not every swarm is hierarchical. OpenAI’s original Swarm framework – now superseded by the production Agents SDK – popularised a flat alternative: handoffs, where peer agents pass control to whichever specialist is most appropriate for the current sub-task. Think of it as a triage nurse routing you to billing, radiology, or the pharmacist, depending on what you need next, rather than a manager who assigns all tasks from the top.

Handoff swarm with triage agent delegating to billing and technical agents
Handoff topology: flat peers, explicit delegation. Well-suited to customer support and domain-routing pipelines.

“The OpenAI Agents SDK enables you to build agentic AI apps in a lightweight, easy-to-use package with very few abstractions. It’s a production-ready upgrade of our previous experimentation for agents, Swarm.” — OpenAI Agents SDK documentation

The SDK’s design is deliberately minimal: agents with instructions and tools, handoffs for delegation, guardrails for validation, and a built-in runner loop. You orchestrate in Python rather than learning a graph DSL. For teams evaluating managed loops with built-in tracing, this is OpenAI’s answer to the local-versus-managed question.

Topology Comparison: Which Layout for Which Job?

Comparison of pipeline, single loop, orchestrator-worker, and handoff swarm topologies
Four topologies. The choice follows from how parallelisable the sub-tasks are and how tightly coupled their results are.
Layout Best for Weak when
Pipeline Fixed ETL flows (extract, summarise, format) The path depends on what intermediate steps discover
Single loop 1-5 tool calls, one user, one domain Many parallel sub-tasks, or context window fills
Orchestrator-worker Research, due diligence, multi-source synthesis Tight shared mutable state, real-time coding
Handoff swarm Support triage, multi-department workflows You need a single unified plan visible to all agents

Google’s A2A (Agent-to-Agent) protocol adds a wire format for agents discovering and calling each other across separate services. Pair it with MCP for tool access and you get the distributed topology described in our A2A + MCP course lesson. The loop logic still lives inside each agent; A2A and MCP standardise the plumbing between them.

Swarms to Keep an Eye On (2026)

The field moves fast. The list below covers the systems and frameworks actually worth tracking, with an honest note on what each one is – and what it is not.

Claude Research (Anthropic product, orchestrator-worker)

Production multi-agent research running inside Claude.ai and the API. Lead agent, parallel subagents with scoped objectives, and a CitationAgent post-processing pass. The engineering post is the best public blueprint for orchestrator-worker at scale: explicit scaling rules embedded in prompts (“simple fact-finding: 1 agent, 3-10 tool calls; complex research: up to 10 subagents”), parallel tool calling, and memory for long-running plan state. This is not a framework you embed in your codebase – it is a reference architecture to learn from and adapt.

OpenAI Agents SDK (successor to Swarm)

Production Python and JavaScript SDK with a built-in runner loop, handoffs, guardrails, MCP integration, session persistence, and tracing. The original Swarm repository stays on GitHub as an educational resource, but OpenAI explicitly positions the Agents SDK as the production upgrade path. Use it when you want managed loop behaviour without writing and maintaining your own runner.

LangGraph (LangChain)

Graph-based orchestration where nodes are steps or agents and edges are conditional transitions. Strong when you need explicit state machines, human-in-the-loop interrupt points, and checkpoint-resume for long-running flows. More structural ceremony than handoff-style delegation, but better observability for complex multi-step pipelines. Our MCP + LangGraph lesson walks through a concrete Node.js integration.

CrewAI

Role-based agent teams (“researcher”, “writer”, “critic”) with a configurable process layer: sequential, hierarchical, or consensual. Popular for demos and internal automation pipelines where roles map cleanly onto recognisable job functions. Watch for duplicated work when role boundaries are loosely defined – two agents with overlapping mandates will pursue the same sources independently.

Microsoft AutoGen

Conversation-centric multi-agent with flexible speaker selection. Strong for iterative refinement patterns such as a coder-reviewer pair. Less opinionated about topology than LangGraph, which means you assemble the conversation pattern yourself rather than following prescribed structures. Good for teams that want flexibility; requires more discipline to keep flows predictable.

Anthropic long-running coding harnesses

Separate from Research: a planner-generator-evaluator trio designed for coding tasks that run over many turns, with externalised plan files and checkpoint-based state. The same orchestrator-worker family, but optimised for mutable repositories rather than read-mostly web research. Most useful as a structural reference when building autonomous coding agents.

Manus and closed commercial products

Products like Manus package multi-agent planning, execution, and verification as a bundled experience. Treat them as black-box reference points for benchmarking UX expectations and latency – not for architectural inspiration, since the internal structure is not public.

For the failure modes that emerge specifically when you have more than one agent – hallucination cascades, trust boundary violations, process-level conflicts – read Multi-Agent Systems: Coordination, Trust, and Failure Modes and Lesson 40: Multi-Agent Failure Modes.

Failure Cases Unique to Swarms

Case 3: Subagent sprawl

Early Anthropic prototypes spawned up to 50 subagents for simple queries. Each spawn is a full context window and tool budget. Without explicit scaling rules embedded in the orchestrator prompt, the model interprets “be thorough” as “spawn as many agents as possible.”

// Dangerous: no budget constraint in orchestrator prompt
lead.spawnSubagents(userQuery);
// Model spawns 20 agents for a question that needs one.

// Better: encode scaling rules directly in the system prompt:
// "Simple fact-finding: 1 subagent, 3-10 tool calls.
//  Direct comparison: 2-4 subagents, 10-15 tool calls each.
//  Complex multi-source research: up to 10 subagents."

Case 4: Telephone-game synthesis

Workers return long prose summaries; the lead agent paraphrases each one and loses precision. By the time three subagent summaries reach the synthesiser, the final answer is the paraphrase of a paraphrase. Anthropic’s mitigation is artifact storage: subagents write structured output to an external store and pass lightweight references back to the lead, which reads the structured data directly rather than receiving a prose retelling.

Case 5: Duplicate search coverage

Without scoped, specific task descriptions, two workers run identical web searches. The delegation prompt must give each subagent a clear objective, an output format, explicit tool guidance, and a task boundary. “Research the semiconductor shortage” is too vague; “List the five largest automotive chip suppliers in Asia who reported supply changes in 2025. Return a JSON array with name and source URL” is a workable scope.

When a Single Agent Is Actually the Right Answer

Multi-agent is not a maturity badge. It is a tool for a specific problem: parallelism across independent sub-tasks that would overflow a single context window. Anthropic note that coding tasks often have fewer truly parallelisable steps than research, and that shared mutable state – a codebase under active revision – fits poorly across independent workers today.

Stay at stage 2 when:

  • The task completes reliably in fewer than eight tool calls.
  • All steps share one context and one user-facing voice.
  • Latency matters more than exhaustive coverage (interactive support chat, not due diligence).
  • Your evaluation shows the swarm wins by less than the cost multiplier justifies. 15x token cost is hard to rationalise for a marginal quality gain.

Move to stage 4 when:

  • Sub-tasks are independent and genuinely parallelisable – for example, finding board members for 500 companies, or scanning legislation across multiple jurisdictions simultaneously.
  • A single context window would overflow if one agent handled everything serially.
  • The economic or informational value of the answer clearly exceeds the token cost – legal research, financial due diligence, security investigations.

What to Check Right Now

  • Map your current stage honestly. Single-shot, one loop, a stack, or a swarm? Most teams overestimate where they are.
  • Measure tokens per successful task. Run single-agent and multi-agent on the same evaluation set before committing to swarm topology.
  • Instrument every loop. Local or managed, you need traces: LLM latency, tool latency, spawn count, and the reason the goal gate fired. Without these, swarm debugging is guesswork.
  • Embed scaling rules in orchestrator prompts. Explicit subagent budgets are the most direct protection against sprawl.
  • Pick one SDK, one protocol. Agents SDK or LangGraph for orchestration; MCP for tools; A2A only if your agents genuinely live in separate services.

The evolution from LLM to swarm is not a ladder you must climb. It is a collection of shapes, each correct for a different problem: one loop for focused tool use, a stack for routing and memory, a swarm for compressing vast parallel search spaces. Choose the smallest shape that passes your evaluations, and upgrade only when the constraints of a single serial context window become the proven bottleneck.

nJoy 😉

Meet Posterboy: SudoAll Now Has an Agentic Publisher

Posterboy robot at a corporate event, POSTERBOY taped on its chest, waving hello

Today is a good day at SudoAll. We launched Posterboy, our agentic publishing system for LinkedIn, and it feels like the site crossed a line we have been walking toward for months: from writing for the web to operating on the web.

Posterboy lives inside the FRIDAY LinkedIn hub. It is not a chatbot stapled onto a share button. It is an agent-shaped workflow that can draft, format, preview, and publish posts through LinkedIn’s official APIs, with a human approval gate before anything goes live. Preview first. Publish when you mean it. That is the whole philosophy, and today it works.

APIs that were not built for agents

LinkedIn did not wake up one morning and ask, “How do we make life easy for autonomous publishers?” Its platform was shaped for people clicking buttons, for OAuth flows designed around human consent, for company pages, image requirements, and partner tiers that assume a traditional app on the other side.

That is exactly why Posterboy matters. Agentic systems do not need platforms to redesign themselves overnight. They need a layer that speaks human API on one side and agent intent on the other. Posterboy is that adapter: it respects LinkedIn’s rules, satisfies its constraints, and still lets an agent do the heavy lifting of composition, formatting, and preparation.

This is the pattern I expect to repeat everywhere. The interesting APIs were written for applications. The interesting work now is teaching agents to use them responsibly, with guardrails, with previews, and with a clear owner who says yes before the post hits the feed.

What it means for SudoAll

SudoAll has always been a writing-first site: long articles, masterclasses, technical depth. Distribution was the manual step at the end. Posterboy closes that gap. When we publish a piece here, we now have an agentic path to carry the idea onto LinkedIn without treating social as an afterthought or a separate silo.

That changes the rhythm of the project. Research and writing stay human-led. Repackaging, hero images, caption structure, and the publish handshake can be agent-led, with David still holding the final say. It is augmentation, not abdication, and it scales the reach of what we already do well.

Why I am enthusiastic

Because this is not vaporware. OAuth is wired. The SudoAll company page is live. The first posts can go out through a pipeline that an agent can drive end to end. Posterboy even has a voice: “Hello, world. SudoAll now has an agentic publisher. I am Posterboy.” Cheeky, but accurate.

We are building in public on a platform that was never designed for us, and it works anyway. That is the kind of friction I like: not fighting the API, not scraping around it, but meeting it on its terms and still moving at agent speed.

More articles will follow on sudoall.com. Posterboy will help them travel. Happy launch day.

nJoy 😉

The Agent Tool Loop: Your Code or the Inference Runtime?

Models plan; code executes. The loop is non-negotiable: propose an action, run it, feed the result back, repeat until done. What is negotiable is where the loop runs – in your service, or inside the inference provider’s runtime behind a single API call.

Flowchart comparing client-side agent loop versus managed loop inside provider runtime
Left: your application owns every hop. Right: one API call, provider runs the inner loop.

The Goal Gate

Both patterns share one thing: a rule that says “we are done.” Without it the agent loops forever. In code it looks like this:

// Three ways to terminate a loop:

// 1. Model returned a final answer (no tool calls requested)
if (!res.tool_calls?.length) return res.content;

// 2. Hard step budget - never let open-ended tasks burn silently
if (step >= MAX_STEPS) throw new Error("step budget exceeded");

// 3. Judge call - a rules check or small classifier says the
//    original question is satisfied even if the model wants more
if (await judge.isSatisfied(originalQuery, accumulatedAnswer)) return;

Local and managed loops differ in who runs that gate – your for loop or the provider runtime – not in whether it exists.

Local Loop: Your Code Owns Every Hop

Example: “What’s the status of order 8842?”

// ── HOP 1: send user message to model ──────────────────────────────
const hop1 = await llm.chat({
  messages: [
    { role: "system", content: "Use lookup_order when needed." },
    { role: "user",   content: "What's the status of order 8842?" }
  ],
  tools: [lookupOrderSchema]
});
// Model does NOT answer yet - it requests a tool:
// { content: null, tool_calls: [{ name: "lookup_order", arguments: { order_id: "8842" } }] }

// ── YOUR CODE RUNS THE TOOL (not the LLM) ──────────────────────────
const result = await db.query(
  "SELECT status, carrier, eta FROM orders WHERE id = $1",
  ["8842"]
);
// { status: "shipped", carrier: "DHL", eta: "2026-06-26" }

// ── HOP 2: feed the result back, get the final answer ──────────────
const hop2 = await llm.chat({
  messages: [
    ...previousMessages,
    { role: "tool", tool_call_id: hop1.tool_calls[0].id,
      content: JSON.stringify(result) }
  ],
  tools: [lookupOrderSchema]
});
// { content: "Order 8842 shipped via DHL, arrives 26 June.", tool_calls: [] }

// ── GOAL GATE: tool_calls empty -> exit loop, send reply to user ───

One user turn. Two LLM round-trips. One DB query. Log all three latency contributions separately – never lump them into a single “response time” figure.

Example: Multi-tool research (three hops, one visible reply)

User: Compare our refund policy with what competitor X published last month.

Two messages in the chat UI. Three internal hops:

  1. Model requests fetch_url(competitorPricingURL). Your scraper runs and returns markdown.
  2. Model requests search_knowledge_base("refund policy"). Your vector DB returns a chunk.
  3. Model synthesises both results. No further tool calls. Goal gate opens.
// Each hop gets its own metric - this is what tells you which one is slow:
metrics.record("llm_hop1_ms",    hop1ElapsedMs);
metrics.record("tool_fetch_url_ms", fetchElapsedMs);
metrics.record("tool_search_kb_ms", searchElapsedMs);
metrics.record("llm_hop2_ms",    hop2ElapsedMs);

The full local runner

async function runAgent(messages, tools, maxSteps = 8) {
  for (let step = 0; step < maxSteps; step++) {
    const res = await llm.chat({ messages, tools });

    // Goal gate: model returned a final answer.
    if (!res.tool_calls?.length) return res.content;

    // Execute every requested tool locally, append results to context.
    for (const call of res.tool_calls) {
      const out = await executeLocally(call);  // your code, your network, your logs
      messages.push({ role: "tool", tool_call_id: call.id,
                      content: JSON.stringify(out) });
    }
  }
  // Goal gate never triggered inside the step budget.
  throw new Error("max steps reached");
}

Managed Loop: Provider Runs the Inner Cycle

Same logical loop, but the provider runs it. Your application sends one request; the provider calls the model, dispatches tools, feeds results back, and returns only when its internal goal gate fires.

Example: "What's the status of order 8842?" (managed)

// ── YOUR APPLICATION: one call ─────────────────────────────────────
const res = await llm.chat({
  messages: [
    { role: "system", content: "You help with orders." },
    { role: "user",   content: "What's the status of order 8842?" }
  ],
  tools: [{
    type: "mcp",
    server_url: "https://tools.example.com/mcp",  // must be publicly reachable
    allowed_tools: ["lookup_order"]
  }]
});
// Your code blocks here. Provider is running the loop internally.
// res.content == "Order 8842 shipped via DHL, arrives 26 June."

// ── WHAT HAPPENED INSIDE (only visible if you enable tracing) ──────
// [1] Model: tool_call lookup_order("8842")
// [2] Provider: connects to MCP server, runs tool, gets JSON
// [3] Model: synthesises final reply
// [4] Goal gate: no further tool calls -> return to client

Same reply. Different accountability surface. Your MCP server must be publicly reachable from provider egress IPs. Empty tool results never pass through your code, so handle them with a system-prompt rule:

// Add to system prompt for any managed-tool agent:
"If a tool returns no data or an error, say so explicitly.
 Do not infer or invent the answer."

Example: Web search agent (managed, serial tools)

// Your app:
const res = await llm.chat({
  messages: [{ role: "user",
    content: "Summarise Kubernetes networking changes in the last six months." }],
  tools: [{ type: "web_search", max_uses: 4 }]
});
// Provider internally: search -> search -> fetch -> synthesise
// User sees one reply after 3-8 seconds.
// Your logs show one API call. Provider traces show four hops.
Stacked latency: model inference plus sequential tool calls
Serial tool calls inside a managed loop add on top of model time. Invisible in your logs without provider tracing.

Side-by-Side: Same Chat, Different Plumbing

Question Local loop Managed loop
Who calls the LLM? Your app, once per hop Provider runtime, inside one request
Who runs tools? Your process / MCP client Provider runtime / remote MCP server
Who implements the goal gate? Your loop + max_steps Provider runtime limits
Debugging surface Your logs and breakpoints Provider trace API
User sees One reply One reply

Latency: Count Every Hidden Hop

// Case 1: local tool cost ranges
// - in-process lookup:       < 1 ms
// - local MCP over stdio:    ~1 ms (warm process)
// - remote HTTP API:         50-300+ ms
// - cold MCP server boot:    seconds on first request

// Case 2: managed loop budget check - do this before committing
const budgetMs     = 800;   // acceptable TTFT for chat UX
const modelMs      = 350;   // per inner LLM call
const innerLlmCalls = 2;    // typical for one tool use
const toolMs       = 280;   // per tool round-trip

const totalMs = modelMs * innerLlmCalls + toolMs;
// 980ms - over budget.
// Fix: keep latency-sensitive tools local, parallelise, or reduce max_uses.

// Case 3: silent failure guard (managed mode)
// Model receives an empty tool result and may hallucinate a confident answer.
// There is no if/catch in your code path - the model is your only filter.
// System prompt rule is the guard:
"If any tool returns empty or an error, say so. Do not invent an answer."

Failures You Still Own in Managed Mode

// WRITE IDEMPOTENCY
// Read tools retry cleanly. Write tools need dedup keys:
const orderPayload = {
  idempotency_key: crypto.randomUUID(),  // provider may retry internally
  ...orderData
};

// NETWORK REACHABILITY
// Tools on private networks cannot move to managed without a proxy.
// Test before cutover:
curl -I https://tools.example.com/mcp  // must respond from provider egress IPs

// ALERT GRANULARITY - split these, they have different fixes:
// - provider_latency_p95
// - mcp_cold_start_ms
// - model_quality_degradation (separate alert, separate runbook)

When to Stay Local

  • During development - you need per-hop logs and breakpoints at the tool boundary.
  • Sub-second chat or voice UX where managed overhead is user-visible.
  • Tools on private networks or requiring local session state.
  • Custom goal gates: compliance judges, human-in-the-loop approval, or termination logic the provider runtime cannot implement.

When Managed Makes Sense

  • Standard search and retrieval where you do not want to run your own search infrastructure.
  • Many separate agents sharing one MCP tool surface.
  • Tool credentials that should not live in application repositories.
  • Teams that prefer paying with a tracing subscription rather than maintaining a loop runner.

Migration Path

  1. Build local first. Log tool name, argument size, latency, result size per hop from day one.
  2. Classify tools: read vs write, local vs remote, latency-sensitive vs batch-tolerant.
  3. Move read-only, tolerant tools to managed first. Keep write tools local until idempotency keys and retry semantics are confirmed.
  4. Enable provider tracing before cutover and compare inner-hop counts against your local logs. Divergence means something changed.

Swarms to Keep an Eye On

This covers the single-agent loop: one context window, serial hops. When tasks decompose into parallel research lanes, or you need orchestrator-worker topologies and peer handoffs, you are in multi-agent territory with different cost structures and failure modes. For the full evolution map - single-shot LLM through ReAct loop, agent stack, orchestrator swarms, and a 2026 watchlist - read the companion piece:

From Chat Completion to Agent Swarms: How Loop Architecture Evolved

What to Check Right Now

  • Draw your loop on paper with numbered hops and the goal gate labelled. If you cannot draw it, your production agent is opaque.
  • Split latency metrics: model time vs tool time per user-visible reply. An aggregate figure hides which hop is slow.
  • Measure MCP cold-start from a fresh process - that is what a first-morning user experiences.
  • Audit write tools for idempotency keys before moving any to managed mode.
  • Tool count check: past ~25 tools, evaluate deferred schema loading to reduce input token cost.

Managed execution shrinks code; it does not shrink accountability. The chat looks the same to the user either way. What changes is whether you can see the loops between their question and the answer - and whether you find out about failures before or after they do.

nJoy 😉

Stop Sending Every Prompt to Your Flagship Model

Every prompt that hits your API is not created equal, but your billing statement treats them as if they are. A twenty-token label extraction and a three-thousand-token reasoning trace cost the same per token at the same endpoint. That means every routing decision you do not make is implicitly a decision to use your most expensive capable model for everything – including the work that does not need it.

Task-tier routing is the fix: a thin layer between your application and the model pool that dispatches each request to a model whose capability (and cost) matches the job. Build it yourself in middleware, use a cascade, or adopt a managed semantic router. The architecture is the same across all three. Only the operational burden differs.

Requests routed to different model size tiers
Size the model to the task. The router makes that decision once per request type, not once per provider negotiation.

The One-Model Bill

Most production backends have at least three distinct call shapes running in parallel, even when they do not know it. Consider a realistic example:

  • Labelling – short text in, a single category token out. The model needs vocabulary and context, not deep reasoning.
  • Conversational answers – multi-turn prose, user-visible, tone-sensitive. Speed and coherence matter.
  • Analysis – long working, structured output, citations. Quality matters more than cost per call.

When all three routes share one endpoint, the label call pays frontier model prices. Here is what that actually costs:

// $/1M tokens (illustrative - your provider rates will differ, ratios hold):
//   Small instruct:     $0.10 input  / $0.30 output
//   Frontier chat:      $3.00 input  / $15.00 output
//   Reasoning-class:    $1.25 input  / $10.00 output
//
// Label call (~120 input tokens, 20 output tokens):
//   Small model:    ~$0.000018
//   Frontier model: ~$0.000660  (~37x more expensive)
//
// At 500,000 label calls per month:
//   Small model:    ~$9
//   Frontier model: ~$330  -- on one call type alone

That waste never appears as a line item. It hides inside one aggregate model spend figure, which is why most teams only discover it when they go looking.

“Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.” — Stripelis et al., TensorOpera Router, EMNLP 2024

Three Ways to Decide Where a Prompt Goes

Each routing mechanism makes a different trade-off between latency overhead, implementation complexity, and accuracy.

Explicit routing. The caller tags its own request: a separate endpoint per task type, or a task field in the request body. Zero additional classifier latency, fully auditable, easy to debug. This is the right default when the caller already knows what kind of job it is sending – which is often the case in structured backends where each function in your code issues a specific type of prompt.

Content-based routing. A small classifier reads the prompt and matches it to a task description you maintain. Adds milliseconds to low seconds of overhead depending on model size, but keeps routing logic out of application code – useful when prompts arrive from users who do not specify intent. Crucially, task descriptions must describe the expected output shape, not the subject matter. “Return one label from: billing | bug | howto | account” will route correctly; “handle billing questions” will not.

Cascades. Send the prompt to the cheapest capable model first and escalate only when a confidence check fails. Yue et al. matched strong-model quality on reasoning benchmarks at approximately 40% of strong-model cost by escalating only when the weak model’s output showed low consistency across repeated samples.

“Through experiments on six reasoning benchmark datasets … we demonstrate that our proposed LLM cascades can achieve performance comparable to using solely the stronger LLM but require only 40% of its cost.” — Yue et al., arXiv:2310.03094

A Three-Tier Layout and Where It Breaks

A practical router uses three tiers: a small instruct model for extraction and labelling, a mid-tier chat model for interactive prose, and a reasoning-class model for deep analysis. The failure cases below are the ones that cause silent quality regressions or surprise cost spikes after initial deployment.

Case 1: Overlapping Task Descriptions Misroute Traffic

If two task descriptions could plausibly describe the same prompt, a semantic classifier will flip between tiers unpredictably or fall back to the default (usually the most expensive model). The problem is always in the description wording, not the classifier itself.

// Weak descriptions - both match "I have a billing question":
//   label: "Handle user messages about the product"
//   chat:  "Help customers with billing"
//
// Strong descriptions - output contract, not subject:
//   label: "Classify intent. Return exactly one token: billing|bug|howto|account"
//   chat:  "Write a helpful multi-sentence reply for the end user"
//
// The classifier routes on output shape, not topic.
// If you cannot write distinct output contracts, merge the tiers.

Case 2: Reasoning Paths Run Out of Completion Budget

Models that use extended thinking or chain-of-thought consume output tokens for reasoning before the visible answer. A 1024-token completion cap can yield empty visible content and a finish_reason: length response, with no error raised. On analysis-tier routes, treat 4096 tokens as a floor for the completion budget, not a ceiling.

Case 3: Switching Models Mid-Session Breaks Prefix Cache and Tone

When a multi-turn session starts on the mid-tier model and escalates to reasoning-class on turn three, two things break: the provider cannot reuse the cached prefix from the earlier turns (repaying full input cost), and the model’s response style may shift noticeably within one conversation. The fix is to pin a session to the model chosen on its first turn.

// Session-pinned routing in Node.js:
async function chat(sessionId, messages) {
  // Look up the model assigned when this session started.
  let tier = await sessionCache.get(sessionId);

  if (!tier) {
    // First turn: classify and persist.
    tier = await classifyTaskTier(messages);
    await sessionCache.set(sessionId, tier, { ttl: 3600 });
  }

  return invokeModel(tier, messages);
}
// Result: consistent prefix cache, consistent voice.
Classifier matching prompts to model pools
The classifier reads intent and routes to the appropriate pool. The fallback path should route to a cheap general model, never silently upgrade.

Selection Policies Inside Each Tier

Choosing a tier is half the decision. Within each tier you still need a policy for picking among multiple available models:

  • Cheapest acceptable for labelling and extraction – quality differences between small models are negligible for binary classification.
  • Lowest TTFT for interactive UI – a 200 ms response gap is noticeable to users; a 5% quality difference is not.
  • Fixed priority list when you have a ranked preference (try provider A, fall back to B, then C on error).
  • Cascade with confidence gate when most queries are easy but a long tail requires the flagship model – escalate only on low-confidence outputs from the cheaper tier.

Always define a fallback. Unmatched prompts should land on a cheap general model, not throw an error or silently route to the most expensive endpoint you have.

Telemetry You Actually Need

Without per-request logging, you cannot separate routing savings from a natural shift in traffic mix. Log these fields for every request: matched tier, serving model, tokens in, tokens out, estimated cost, TTFT, and total latency. After a week, plot cost-per-task-type before and after routing. That is the only number that proves the router is doing its job.

When to Skip Routing Altogether

Routing adds a moving part that requires calibration and monitoring. Skip it when:

  • Every call genuinely needs the same capability – there is no cheaper model that passes your quality bar.
  • Monthly spend is too small to justify the operational overhead of maintaining task descriptions.
  • Your compliance requirements demand deterministic, explainable model selection for every request (use explicit endpoint tagging instead).
  • You have not yet measured your actual task mix – routing based on assumptions rather than data will create more problems than it solves.

Rollout Without Surprises

  1. Sample first. Pull a week of production prompts and label them by task type manually. You need ground truth before building a classifier.
  2. Price the counterfactual. For each task type, calculate what it would cost at the cheaper tier. This is your upper-bound savings estimate.
  3. Eval the cheaper models on each task type before any traffic moves. Do not assume quality is adequate; measure it.
  4. Shadow-route. Run the router in logging-only mode: record decisions and estimated costs, but still serve the original model. Catch misroutes before they affect users.
  5. Canary 5% of live traffic. Watch real cost and quality metrics for a week before full cutover.

What to Check Right Now

  • Task mix histogram from last week’s logs – if you do not have one, that is the first step.
  • Retrospective cost estimate – what would tiered routing have cost versus what you actually spent?
  • Task description overlap if you are already using or planning semantic routing.
  • Completion token floors on any analysis or reasoning path – check that max_tokens is not silently truncating outputs.

The waste is invisible until you look for it. Once you see the histogram of task types against the tier each one actually needed, it is very hard to unsee.

nJoy 😉

One GPU, Two Bottlenecks: Serving Vision-Language Models Without Wasting Silicon

You shipped a vision-language model on the same GPU stack that served your text LLM. Same vLLM config, similar parameter count, no red alarms in monitoring – yet throughput fell anyway. Inter-token latency crept up. Batch-size tuning helped a little; quantisation helped a little; neither explained the gap.

The issue is not a bad deploy. The issue is that multimodal inference is secretly two completely different workloads with opposite hardware appetites, and you are running both on one piece of silicon. Aggregate GPU utilisation looks fine because each phase is idle during the other one’s bottleneck. Users feel the slowdown; your dashboard hides it.

Two GPU workload phases with opposing compute and memory demands
Encode wants tensor cores. Decode wants memory bandwidth. One card rarely satisfies both demands at the same time.

Encode vs Decode: A Resource Inversion

A VLM pipeline is two engines operating in sequence. Understanding the difference between them is the whole article, so let us be precise.

Image encoding is a matrix-intensive operation. Patches from the image flow through a ViT-style vision tower; the bottleneck is floating-point compute. Profiling on consumer cards typically shows 80%+ tensor-core utilisation with single-digit HBM bandwidth use. The GPU is doing arithmetic; it is barely touching main memory.

Token decoding is the exact opposite. Each new token requires reloading the full weight matrix and the growing KV cache from HBM, performing a small amount of arithmetic, emitting one token, and repeating. On datacenter GPUs, bandwidth saturates while tensor cores sit nearly idle. The GPU is reading memory; it is barely doing arithmetic.

Co-locate both phases on one card and you permanently pay for capacity that each phase ignores. During encode, HBM is underused. During decode, tensor cores are underused. Neither phase gets hardware tuned for what it actually needs.

“Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound.” — Donglin Yu et al., arXiv:2603.12707

Kleppmann’s Designing Data-Intensive Applications makes a related point about aggregate metrics: when the bottleneck shifts between pipeline stages, a single utilisation number averages two opposite stories into one misleadingly healthy reading. You need phase-level instrumentation, not a system-wide gauge.

Visual Tokens and the KV Cache Problem

To understand why high-resolution images hurt decode performance specifically, you need to understand what the KV cache holds and why it costs bandwidth.

During prefill, the model builds a KV cache: per-layer key and value tensors for every token in the input context. Without it, autoregressive decode would recompute full attention over the entire history on each step, which scales quadratically with sequence length. The cache trades memory for compute. That is a good trade – unless the cache is enormous.

Visual tokens join the cache at prefill and stay there for every single decode step. The model has already compressed the image into embeddings, but hundreds of image tokens still occupy HBM and get re-read with every token generated. The bandwidth cost is proportional to output length, not to how “done” the image processing feels.

Case 1: High-Resolution Inputs Bloat the Cache Before Decode Begins

A modest 336×336 image can produce approximately 576 visual tokens. Add 128 text tokens and you have 704 tokens in the KV cache before the first answer token is generated. For a 7B MHA model at FP16, that is roughly 350 MB per request. Scale to eight concurrent requests and you are using ~2.8 GB of cache capacity before generation has even started.

// FP16 KV cache size (one sequence):
// bytes ≈ 2 × layers × seq_len × kv_heads × head_dim × 2
//
// Text-only request (128 tokens):  ~64 MB
// Same request + one image (704 tokens): ~350 MB
//
// Quantisation reduces bytes per element.
// It does not remove the 576 persistent image token slots
// that get re-read on every decode step.

Case 2: Concurrency Makes the Problem Super-Linear

Raising batch size under multimodal load can actually hurt. More concurrent requests means more fat KV caches competing for the same HBM bandwidth simultaneously. Inter-token latency often climbs faster than image-token count alone would predict – the slope steepens because bandwidth contention compounds.

// Metrics to correlate (log all three together):
// - kv_cache_bytes per request vs image resolution
// - hbm_bandwidth_util during decode phase only
// - itl_p95 vs concurrent_request_count
//
// If ITL grows super-linearly with concurrency, you have
// bandwidth contention, not a "slow model".
KV cache growth with vision tokens
Image tokens enter the cache at prefill and remain for every decode step. Output length, not image size alone, sets the total bandwidth cost.

Where to Split the Pipeline

Once you accept that the two phases want different hardware, the next question is where to draw the boundary – and the answer matters more than it looks.

Prefill/decode split cuts the pipeline after prefill: a prefill node builds the KV cache and ships it to a decode node. The payload is the full KV tensor – hundreds of MB to GB depending on model depth and context length. That demands high-bandwidth interconnect (NVLink, InfiniBand). Ordinary PCIe clusters do not have enough throughput to make this worthwhile.

Encoder/decoder split cuts earlier: a vision encoder node processes the image and ships only the resulting embeddings to the LLM node. The payload is just token_count × hidden_size – the KV cache does not exist yet, so you never ship it across the wire.

// LLaVA-7B-style numbers (576 vision tokens, dim 4096, FP16):
//
// Vision embeddings across the wire: ~4.5 MB
// Full KV cache for same context: ~350 MB
// Ratio: ~78x less data at the encoder boundary
//
// PCIe transfer at 16 GB/s:
//   4.5 MB  -> sub-millisecond
//   350 MB  -> tens of milliseconds

“Partitioning here reduces transfer complexity from O(L·s_ctx) bytes (GB-scale KV caches under stage-level disaggregation) to O(N_v·d) bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth.” — Donglin Yu et al., arXiv:2603.12707

Yu et al. report 12×–196× transfer reductions across current architectures depending on model depth. The gap widens as models get deeper: embeddings stay compact at roughly the same size, while KV migration cost grows linearly with the number of transformer layers.

Matching Silicon to Phase

Once the split is in place, hardware assignment follows naturally. Encode nodes want FLOPs-per-dollar – consumer or commodity compute cards work well here. Decode nodes want HBM bandwidth and large VRAM – A100/H100 class. An RTX 4090 and an A100 have similar peak TFLOPS; the 4090 wins on FLOPs per dollar, but the A100 wins on memory bandwidth and total VRAM.

Heterogeneous deployments under this split have shown approximately 40% cost savings versus homogeneous baselines in recent benchmarks, with no measurable latency regression when scheduling is handled correctly. The standard inference engine tricks – CUDA graphs, packed prefill, paged KV – still apply and still matter, but they do not substitute for matching silicon to the workload phase.

Case 3: Text-Only Traffic Leaves Encoders Idle

A hard encode/decode split creates a utilisation problem during text-only bursts: the encoder pool sits idle while the decoder pool is saturated. Work-stealing schedulers solve this by letting encoder nodes absorb decode jobs when the vision queue is empty. The roles are not symmetric – decode workers cannot encode – but encode workers can handle text-only generation with their available compute, recovering utilisation without fragile dynamic role reassignment.

Diagnose Before You Re-Architect

Before splitting pools, confirm the bottleneck is actually the encode/decode phase mismatch and not something simpler:

  1. Isolate the variable. Hold model and sampling parameters fixed; vary only image resolution and count across requests.
  2. Plot ITL against vision-token count at a realistic concurrency level. A flat slope means bandwidth is not the issue.
  3. Profile each phase separately using Nsight Systems or equivalent. Look for compute-bound encode and bandwidth-bound decode as the diagnostic signature.
  4. Correlate with your serving metrics. vLLM exposes KV-cache utilisation and scheduler queue depth; use them.

When Monolithic Serving Is Enough

Dual pools make economic sense at sustained multimodal volume where a significant fraction of total inference spend is attributable to this mismatch – roughly a third or more of your bill. For prototypes, low-QPS applications, and teams that do not have the operational capacity to run and schedule heterogeneous fleets, staying on a single GPU is the right call until the numbers force a change. The same encode/decode asymmetry applies to audio and video input towers, not just vision – so if you later add speech input, the analysis carries over directly.

What to Check Right Now

  • KV bytes per request across the range of image resolutions your users actually send.
  • Phase-split profiling to confirm compute-bound encode and bandwidth-bound decode are both present.
  • ITL slope under concurrent load – super-linear growth signals bandwidth contention, not raw model speed.
  • Business case first. Heterogeneous pools add scheduling complexity. Run the cost arithmetic before committing to the architecture.

Half your GPU was on holiday during every request. The fix is giving each phase the silicon it actually uses – and knowing, precisely, which phase is slowing you down before you spend money on a bigger card.

nJoy 😉

Gemma 4 on Apple Silicon: All Four Models Compared, Benchmarked, and Running Locally

Google just dropped a 31-billion-parameter model that sits on the same leaderboard row as Claude Sonnet 4.5 and outranks models 20 times its size. That is not a typo. Gemma 4 31B, released under Apache 2.0 on 2 April 2026, is the densest punch-per-parameter open model the industry has ever seen, and you can run it on the MacBook you already own. If you have been paying $20-200 a month for API access to frontier models, this article is about to ruin your budget justification.

Gemma 4 31B dense model architecture visualisation, dark technical diagram
Gemma 4 31B – Google DeepMind’s flagship open model, now runnable on consumer hardware.

What Exactly Is Gemma 4 31B?

Gemma 4 is a family of open-weight models from Google DeepMind, built from the same research and technology that powers Gemini 3. The family ships in four sizes: E2B and E4B for phones and edge devices, a 26B Mixture-of-Experts (MoE) variant, and the 31B dense flagship. This article focuses on the 31B dense model, which is the largest, highest-quality member of the family.

The 31B is a dense transformer, meaning all 30.7 billion parameters fire on every single token. There is no routing, no gating, no “some experts sleep whilst others work.” Every weight participates in every inference step. That architectural simplicity buys you two things: predictable behaviour and maximum quality per parameter.

Here are the core specifications, straight from the official model card:

Property Gemma 4 31B Dense
Total Parameters 30.7B
Active Parameters 30.7B (all of them, every token)
Layers 60
Context Window 256K tokens
Sliding Window 1,024 tokens
Vocabulary Size 262K
Vision Encoder ~550M parameters (27-layer ViT with 2D RoPE)
Audio Not supported (E2B/E4B only)
Licence Apache 2.0
Input Modalities Text + Images (variable resolution)

The architecture uses a hybrid attention mechanism that interleaves local sliding-window attention with full global attention, ensuring the final layer is always global. Global layers use unified Keys and Values with Proportional RoPE (p-RoPE) to keep memory manageable at long context lengths. In plain English: the model can see its full 256K-token window without the memory cost exploding the way it would with naive full attention on every layer.

“Built from the same world-class research and technology as Gemini 3, Gemma 4 is the most capable model family you can run on your hardware.” – Google, Gemma 4 Launch Blog

Dense vs Sparse vs MoE: The Architecture That Matters

Understanding why Gemma 4 ships two different 20-30B models requires understanding three architectural paradigms that define how modern LLMs spend compute. This is the single most important concept for choosing which model to run locally, so let us get it right.

Dense Models: Every Neuron, Every Token

A dense transformer activates 100% of its parameters on every forward pass. If a model has 31 billion parameters, it performs 31 billion parameters’ worth of computation for every single token it generates. This is the classical architecture from “Attention Is All You Need” (Vaswani et al., 2017), and it remains the gold standard for raw quality. Dense models are simpler to train, more predictable in behaviour, and generally produce the highest-quality outputs at a given parameter count.

The downside is obvious: compute cost scales linearly with parameter count. Double the parameters, double the FLOPs per token. Gemma 4 31B is a dense model, and that is precisely why it tops the quality charts.

Mixture-of-Experts (MoE): Conditional Computation

MoE models replace certain feed-forward layers with multiple parallel “expert” sub-networks. A learned routing network examines each token and decides which experts handle it. Only a small subset of experts activate per token, so the total parameter count far exceeds the active parameter count.

Take Gemma 4’s 26B A4B variant as a concrete example:

Property 26B A4B MoE 31B Dense
Total Parameters 25.2B 30.7B
Active Parameters per Token 3.8B 30.7B
Expert Count 128 total, 8 active + 1 shared N/A (dense)
Layers 30 60
Arena AI Score 1,441 1,452
Inference Speed ~4B model speed ~31B model speed

The 26B MoE only activates 3.8 billion parameters per token. That means it computes at roughly the speed of a 4B dense model, despite having the “knowledge capacity” of a 25B model. The trade-off? Slightly lower peak quality and less predictable behaviour for fine-tuning, because the routing decisions add a stochastic element the dense model does not have.

Gemma 4’s MoE is architecturally unusual: each layer runs both a dense GeGLU FFN and a 128-expert MoE system in parallel, then sums the outputs. Most MoE architectures replace the FFN entirely. Gemma 4 keeps both, which partly explains why its MoE variant scores so close to the dense model despite activating far fewer parameters.

Sparse Models: The General Category

MoE is a specific type of sparse architecture, but “sparse” is the broader umbrella. Any model that selectively activates a subset of its parameters per token is sparse. The key insight, as described in Christopher Bishop’s Pattern Recognition and Machine Learning, is that not every feature in a learned representation is relevant to every input. Sparsity exploits this by routing computation only where it is needed.

Here is the practical cheat-sheet:

Architecture Compute per Token Memory Footprint Best For
Dense All parameters All parameters must fit Maximum quality, fine-tuning, predictable outputs
MoE (Sparse) Active subset only All parameters must still fit Fast inference, responsive chat, latency-critical agents
Quantised Dense All parameters (reduced precision) Reduced (e.g. 4-bit = ~4x smaller) Running dense models on constrained hardware

A critical nuance: MoE does not reduce memory requirements. All 25.2B parameters of the 26B MoE must be loaded into memory even though only 3.8B are active per token. The inactive experts are idle but still resident. MoE saves compute, not memory. This is why quantisation and MoE are complementary techniques, and why running the Q4-quantised 31B dense on a Mac with 24GB is actually a better deal than running the full-precision 26B MoE.

Dense vs MoE architecture comparison diagram, dark technical illustration
Dense models fire every neuron; MoE routes each token through a small subset of specialised experts.

The Benchmarks: Arena Rankings and Hard Numbers

Benchmarks are a minefield of cherry-picked numbers and suspiciously round percentages. So let us look at two sources: the Arena AI human-preference leaderboard and the automated benchmark suite from Google’s own model card.

Arena AI: Human Preference Rankings

As of 31 March 2026, the Arena AI text leaderboard has 337 models ranked from 5.7 million human votes. Here is where Gemma 4 lands in the overall table:

Model Organisation Licence Arena Score
Claude Opus 4.6 Thinking Anthropic Proprietary 1,504 +/- 6
Claude Opus 4.6 Anthropic Proprietary 1,499 +/- 5
Gemini 3.1 Pro Google Proprietary 1,494 +/- 5
Claude Sonnet 4.5 Thinking Anthropic Proprietary 1,452 +/- 3
Gemma 4 31B Google Apache 2.0 1,452 +/- 9
Qwen 3.5 397B A17B Alibaba Apache 2.0 1,449 +/- 6
Gemini 2.5 Pro Google Proprietary 1,448 +/- 3
Gemma 4 26B A4B Google Apache 2.0 1,441 +/- 9

Read that again. Gemma 4 31B scores 1,452, matching Claude Sonnet 4.5 Thinking and outranking Gemini 2.5 Pro and Qwen 3.5 397B. Among open-source models, it is ranked #3 in the world. This 31-billion-parameter model is competing with, and beating, models that are far larger. Google claims it “outperforms models up to 20 times larger,” and the Arena data backs that up.

Automated Benchmarks: The Full Picture

Here is a compact benchmark comparison from Google’s official model card:

Benchmark Gemma 4 31B Gemma 4 26B MoE Gemma 3 27B
MMLU Pro 85.2% 82.6% 67.6%
AIME 2026 89.2% 88.3% 20.8%
LiveCodeBench v6 80.0% 77.1% 29.1%
GPQA Diamond 84.3% 82.3% 42.4%
Codeforces ELO 2,150 1,718 110
MMMU Pro 76.9% 73.8% 49.7%
MMMLU 88.4% 86.3% 70.7%

The AIME 2026 jump is staggering: from 20.8% to 89.2%. The Codeforces ELO went from 110 to 2,150. This is not a small step over Gemma 3, it is a generational leap.

Running Gemma 4 31B on a Mac: The Practical Guide

This is where it gets exciting for anyone with an Apple Silicon Mac. The unified memory architecture on M-series chips is a genuine superpower for local LLM inference, because the GPU and CPU share the same RAM pool. No separate VRAM cliff. If you have 24GB, 36GB, or more of unified memory, you are in business.

Memory Requirements

Precision Approx. Size Minimum Memory Mac Recommendation
BF16 ~58 GB 64 GB+ M2/M3/M4 Max 64GB+
FP8 ~30 GB 36 GB+ M3/M4 Pro 36GB
Q4_K_M ~20 GB 24 GB+ M2/M3/M4 Pro 24GB
Q3 ~15 GB 18 GB+ Smaller Macs

The sweet spot for most Mac users is Q4_K_M quantisation at about 20GB. This is the default distribution on Ollama, and it fits comfortably on a 24GB Mac with some headroom left for the operating system.

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Or download the macOS app directly from ollama.com.

Step 2: Pull and Run the Model

ollama run gemma4:31b

That is it. Two commands total. The download is around 20GB, and then you are chatting with a model that matches Claude Sonnet 4.5 on the Arena leaderboard.

Expected Performance on Apple Silicon

Mac Configuration Quantisation Approx. Speed Notes
M4 Max 128GB Q4_K_M 40-50 tok/s Very fast local inference
M3/M4 Pro 36GB Q4_K_M 20-35 tok/s Comfortable for extended use
M2/M3 Pro 24GB Q4_K_M 15-25 tok/s Usable, context size matters
M1/M2 16GB Q3 8-15 tok/s Tight, consider 26B MoE or E4B

For reference, human reading speed is roughly 4-5 tokens per second. Even the slower configurations are still readable in real time.

MLX: The Apple Silicon Optimiser

If you want to squeeze more performance out of your Mac, look into MLX, Apple’s machine learning framework optimised specifically for Apple Silicon. Community support for Gemma 4 landed almost immediately, and MLX-optimised models can outperform GGUF-based inference on the same hardware.

pip install mlx-lm
mlx_lm.generate --model Phipper/gemma-4-31b-it-mlx-4bit --prompt "Hello, world"

The trade-off: MLX requires more manual setup than Ollama. For most users, Ollama is the right starting point. For performance enthusiasts, MLX is where things get fun.

Mac running local AI model, dark technical setup
Apple Silicon’s unified memory architecture makes Macs surprisingly capable local LLM machines.

The Complete Gemma 4 Family: Four Models, Four Use Cases

The 31B dense flagship is the headline act, but Google shipped three other models in the same family, and understanding the full lineup matters because the right model for you depends on what you have in your pocket, on your desk, or in your rack. Here is the entire family at a glance:

Model Architecture Effective Params Context Modalities Q4 Memory
E2B Dense (edge) 2.3B 128K Text, Image, Audio ~3.2 GB
E4B Dense (edge) 4.5B 128K Text, Image, Audio ~5 GB
26B A4B MoE (128 experts) 3.8B active 256K Text, Image ~15.6 GB
31B Dense 30.7B 256K Text, Image ~17.4 GB

Two things jump out immediately. First, the smaller models are the ones with audio support, not the flagship. The E2B and E4B each carry a dedicated ~300M-parameter audio encoder that the larger models lack. Second, the edge models use a technique called Per-Layer Embeddings (PLE), which gives each decoder layer its own small embedding table for every token. These tables are large but only used for lookups, which is why the “effective” parameter count is much smaller than the total on disk.

Gemma 4 E2B: The Phone Model

E2B has 5.1 billion total parameters but only 2.3 billion effective, and it fits in roughly 3.2 GB at Q4 quantisation. This is small enough to run on a three-year-old smartphone. Through the Google AI Edge Gallery app (available on both iOS and Android), you can download E2B at about 2.5 GB on disk and start chatting with it entirely offline.

The performance claim that shocked the community: E2B beats Gemma 3 27B on most benchmarks despite being roughly 12x smaller in effective parameters. One early tester running it on a basic i7 laptop with 32 GB RAM reported it was “not only faster, it gives significantly better answers” than Qwen 3.5 4B for finance analysis. On a phone, users are seeing roughly 30 tokens per second, which is genuinely conversational speed.

For a Mac with only 8 GB of unified memory, E2B at Q4 is the safe bet. It leaves plenty of headroom for macOS and whatever else you are running. Install it with:

ollama run gemma4:e2b

Gemma 4 E4B: The Best Small Model You Can Run Anywhere

E4B is the sweet spot for anyone who wants something meaningfully smarter than E2B without jumping to the heavyweight models. At 8 billion total parameters (4.5B effective) and ~5 GB at Q4, it fits comfortably on any Mac with 16 GB of memory and leaves room for a browser, an IDE, and Slack running simultaneously.

E4B is the model David Ondrej demonstrated running on his iPhone 16 Pro Max in the video, and it was clearly usable at conversational speeds. The Edge Gallery app lists it at 3.6 GB on disk. On a phone with a modern chip, expect 20-30 tokens per second. On a Mac with 16 GB, expect 40-60+ tokens per second since the model is small enough to stay entirely in the GPU memory partition.

Crucially, E4B supports native audio input alongside text, image, and video. That means on-device speech recognition, spoken language understanding, and audio analysis, all without sending a byte off your machine. The 31B flagship cannot do any of this.

ollama run gemma4:e4b

Gemma 4 26B A4B: The Speed Demon

The 26B MoE is the model for people who want high-end quality at dramatically lower latency. Despite having 25.2 billion total parameters, only 3.8 billion are active per token, which means it runs at roughly the speed of a 4B dense model whilst retaining the knowledge capacity of a 25B model.

Real-world benchmarks from Kartikey Chauhan’s testing on a 12 GB VRAM Nvidia card show 44.2 tokens per second for text at 128K context and 42.1 tok/s for vision at 64K context. Those are server-grade numbers from consumer hardware.

On a Mac with 16 GB of unified memory, the 26B A4B at Q4 (~15.6 GB) is technically possible but tight. You will be at the limit of available memory, and macOS itself needs headroom. A 24 GB Mac runs it comfortably. For 16 GB Macs, be conservative with context length and expect some performance degradation from memory pressure.

ollama run gemma4:26b

Quality vs Size: What You Actually Lose at Each Step Down

The perennial question with model families is: how much quality do you sacrifice for each size reduction? With Gemma 4, Google published enough benchmark data to answer this precisely. Here is the full family compared side by side:

Benchmark 31B Dense 26B MoE E4B E2B Gemma 3 27B
MMLU Pro 85.2% 82.6% 69.4% 60.0% 67.6%
AIME 2026 (Maths) 89.2% 88.3% 42.5% 37.5% 20.8%
LiveCodeBench v6 80.0% 77.1% 52.0% 44.0% 29.1%
GPQA Diamond 84.3% 82.3% 58.6% 43.4% 42.4%
MMMU Pro (Vision) 76.9% 73.8% 52.6% 44.2% 49.7%
MMMLU (Multilingual) 88.4% 86.3% 76.6% 67.4% 70.7%
Tau2 Agentic (avg over 3) 76.9% 68.2% 42.2% 24.5% 16.2%
Codeforces ELO 2,150 1,718 940 633 110

The pattern is clear: the 31B-to-26B step is almost free. You lose roughly 2-3 percentage points on most benchmarks but gain dramatically faster inference. This is the best trade-off in the entire lineup. The 26B MoE at 88.3% on AIME is essentially indistinguishable from the 31B’s 89.2% for any practical purpose.

The 26B-to-E4B step is where the cliff hits. You go from 88.3% to 42.5% on AIME, from 77.1% to 52.0% on LiveCodeBench, and from 85.5% to 57.5% on agentic tasks. This is where “frontier local model” becomes “capable assistant.” E4B is excellent for its size, but it is not in the same league as the two larger models for maths, competitive coding, or complex tool use.

The E4B-to-E2B step is gentler than expected. E2B typically loses 5-15 percentage points versus E4B, which is surprisingly modest given the 2x parameter difference. For basic Q&A, translation, summarisation, and conversational use, E2B is genuinely useful. It even beats Gemma 3 27B on multilingual tasks (67.4% vs 70.7% is close, but E2B’s AIME score of 37.5% vs Gemma 3’s 20.8% is a clear win).

Perhaps the most striking trend in the table: E2B scores 24.5% on Tau2 agentic tasks versus Gemma 3 27B’s 16.2%. A model you can run on a phone outperforms last year’s full-size model at tool use by a clear margin. Meanwhile, the 31B’s 76.9% average across all three Tau2 domains is nearly 5x what Gemma 3 managed. That is not an incremental improvement; it is proof that architectural progress matters more than raw scale.

Running Every Gemma 4 Model: A Hardware Decision Tree

Here is the practical guide to matching your hardware to the right model. Start from whatever you own and work your way to the best model it can handle:

Your Hardware Best Gemma 4 Model Quantisation Expected Speed Quality Tier
iPhone / Android (3+ years old) E2B INT4 ~30 tok/s Good assistant, basic coding
iPhone / Android (recent) E4B INT4 20-30 tok/s Strong assistant, decent coding
Mac M1/M2 8GB E2B or E4B Q4 50-80 tok/s Good assistant with audio
Mac M1/M2/M3 16GB E4B (safe) or 26B A4B (tight) Q4 40-60 / 15-25 tok/s Strong / Near-frontier
Mac M2/M3/M4 Pro 24GB 26B A4B or 31B Q4 25-40 / 15-25 tok/s Near-frontier / Frontier
Mac M3/M4 Pro 36GB 31B Q4 or Q8 20-35 tok/s Frontier
Mac M3/M4 Max 64GB+ 31B BF16 40-50 tok/s Frontier, full precision
Nvidia GPU 12GB VRAM 26B A4B Q5 ~44 tok/s Near-frontier

The 12 GB Nvidia GPU result deserves special mention. Kartikey Chauhan’s detailed benchmarking of the 26B A4B on a 12 GB card using llama.cpp showed 44.2 tokens per second for text and 42.1 tok/s for vision, both at 128K context. He reported that the model is “an excellent default” for daily interactive use, with stable generation and no constant OOM babysitting once the right memory profile is set. The key was using fit-based GPU placement rather than forcing everything into VRAM.

For the edge models on phones, the Google AI Edge Gallery app is genuinely the easiest path. Download it, pick E2B or E4B, wait for the 2.5-3.6 GB download, and start chatting. Everything runs offline, nothing leaves your device, and the models support function calling for agentic tasks directly on the phone.

The 16 GB Mac Dilemma

The most common question in the community: “I have a MacBook with 16 GB, can I run the good stuff?” The honest answer is nuanced:

  • E4B at Q4 (~5 GB): Runs beautifully. Fast, responsive, with plenty of headroom. This is the comfortable choice.
  • 26B A4B at Q4 (~15.6 GB): Technically fits but leaves almost no room for macOS and apps. Expect memory pressure, swap usage, and slower generation as context grows. Usable for short conversations; painful for long ones.
  • 31B at Q4 (~17.4 GB): Does not fit. You will hit swap immediately, and inference will crawl.

If you have a 16 GB Mac and want the best possible quality, the 26B A4B is your ceiling, but keep context short and close other apps. If you want a smooth, reliable experience, E4B is the pragmatic winner. It scores 52% on LiveCodeBench (enough for practical coding help), 58.6% on GPQA Diamond (solid science reasoning), and it can process audio natively, which neither of the larger models can.

How Good Are These Models for Coding?

If you are a developer considering local models as a coding assistant, the benchmark numbers matter less than a straight answer: can this thing actually help me write code? Here is the honest breakdown for each model, using LiveCodeBench v6 (real coding tasks, not just function completion) and Codeforces ELO (competitive problem solving) as the primary yardsticks:

Model LiveCodeBench v6 Codeforces ELO Comparable To Practical Coding Level
E2B 44.0% 633 GPT-3.5-class Handles boilerplate, simple functions, basic refactors. Struggles with multi-file logic or complex algorithms.
E4B 52.0% 940 GPT-4o-mini / Claude 3.5 Haiku Writes working functions, understands context, handles standard patterns. The level that powers most “free tier” coding assistants.
26B A4B 77.1% 1,718 GPT-4o / Claude 3.5 Sonnet Strong coder. Handles multi-step problems, debugging, architectural reasoning, and non-trivial algorithms reliably.
31B 80.0% 2,150 Claude Sonnet 4.5 Frontier-class. Solves most competitive programming problems and writes production-quality code with real architectural awareness.

The Codeforces 1,718 ELO for the 26B MoE puts it at roughly “Candidate Master” level, meaning it can solve the majority of interview-style programming problems and a solid chunk of competitive challenges. The 31B at 2,150 ELO is in “Master” territory. For context, Gemma 3 27B scored 110 ELO on the same benchmark. That is not a typo.

The practical takeaway: if you have the memory for the 26B A4B or 31B, you have a genuinely capable local coding assistant that rivals the paid API models most developers use today. If you are limited to E4B, you still get a useful companion for everyday development, roughly on par with the models that power free-tier tools like GitHub Copilot’s lighter backend. E2B is better suited for quick scripting help, code explanation, and boilerplate generation than for serious algorithmic work.

A Suggested Workflow for Constrained Hardware

If your Mac cannot comfortably run the 26B or 31B, a practical approach is to run E4B as your always-on local model for inline help, autocomplete, and quick questions, then fall back to a cloud API (Claude, GPT-4o, or Gemma 4 31B via Google AI Studio, which offers a free tier) for the 20% of problems where E4B is not enough. You get speed and privacy for the easy stuff, and quality for the hard stuff.

CPU-Only Servers: Running Gemma 4 Without a GPU

Not everyone runs inference on a laptop or a gaming PC. If you have access to a rack server, a cloud VM, or any x86 machine with a lot of RAM but no GPU, Gemma 4 still works. The entire family runs on CPU-only hardware via llama.cpp, Ollama, or vLLM.

The key constraint on CPU-only inference is memory bandwidth, not compute. LLM token generation is fundamentally a memory-bound operation: the model reads weights from RAM for every token. A typical DDR4 server delivers 40-80 GB/s of memory bandwidth, versus 200-400 GB/s on Apple Silicon or 900+ GB/s on an Nvidia A100. Those extra CPU cores help with prompt ingestion (prefill) but barely move the needle on generation speed.

Here is what to expect on a typical high-core-count x86 server with DDR4 (e.g., a dual-socket Xeon or EPYC with 256-384 GB RAM):

Model Precision RAM Used Est. Generation Speed Best Use Case
E2B BF16 ~10 GB 15-30 tok/s High-throughput batch processing, multi-worker serving
E4B BF16 ~16 GB 10-20 tok/s Quality-per-watt sweet spot for CPU serving
26B A4B BF16 ~50 GB 8-15 tok/s Near-frontier quality, MoE helps since less data moves per token
31B BF16 ~58 GB 3-8 tok/s Maximum quality when latency is not critical

With 384 GB of RAM, you can run the 31B at full BF16 precision with no quantisation loss at all. Most consumer setups cannot do this. The trade-off is generation speed: expect 3-8 tokens per second for the 31B on DDR4, which is below human reading speed (~4-5 tok/s) but still usable for batch jobs, API backends, or any workflow where you do not need instant responses.

The 26B MoE is the star on CPU-only servers. Because only 3.8B parameters are active per token, it moves far less data through the memory bus than the 31B dense model, which means the memory-bandwidth bottleneck hurts less. Expect 8-15 tok/s at full precision, which is genuinely conversational speed, with quality only 2-3% behind the flagship.

For serving multiple concurrent users, consider running several E4B instances across those 56 cores rather than one large model. Each instance uses ~16 GB at BF16, so you could run 10+ parallel workers within 384 GB of RAM, giving you high aggregate throughput for an internal team.

Multimodal Capabilities: What It Can and Cannot See

Gemma 4 31B is multimodal for vision, accepting both text and images as input with text output. It includes a ~550M-parameter vision encoder and supports variable aspect ratios and resolutions.

  • Object detection and description – identify and describe objects in images
  • Document and PDF parsing – extract structure and text
  • OCR – including multilingual OCR
  • Chart comprehension – read graphs and visual data
  • Screen and UI understanding – parse app screenshots and interfaces
  • Video understanding – analyse sequences of frames

On MMMU Pro, Gemma 4 31B scores 76.9%, up from Gemma 3’s 49.7%. That is a serious jump in multimodal quality.

What it cannot do: the 31B model does not support audio input. Audio is only available on E2B and E4B. So if you need speech recognition or spoken language understanding, the small models are actually more capable in that modality than the flagship.

140+ Language Support

Gemma 4 is trained on over 140 languages, with out-of-the-box support for 35+ languages. Community testing suggests it is especially strong on multilingual tasks, and the official MMMLU score of 88.4% backs that up.

“Natively trained on over 140 languages, Gemma 4 helps developers build inclusive, high-performance applications for a global audience.” – Google AI for Developers, Gemma 4 Model Overview

This multilingual strength is one of Gemma 4’s real differentiators. If you build products for non-English audiences, this is not a side feature, it is the feature.

Choosing the Right Model: A Practical Decision Guide

With four models in the family, the question is no longer “should I run Gemma 4?” but “which Gemma 4?” Here is the decision matrix:

  • You have 24 GB+ and want the absolute best quality: Run the 31B dense. It is the quality ceiling of the family.
  • You have 24 GB+ but care about speed: Run the 26B A4B MoE. You lose 2-3% on benchmarks but gain roughly 2-4x faster inference. For most real tasks, you will not notice the quality difference.
  • You have a 16 GB Mac: The E4B is your best realistic option. The 26B A4B technically fits at Q4 but will struggle with memory pressure. E4B leaves comfortable headroom and still scores above Gemma 3 27B on key benchmarks.
  • You have an 8 GB Mac or a phone: Run E2B. At ~3.2 GB it fits anywhere, and it still beats Gemma 3 27B on maths and coding benchmarks despite being 12x smaller.
  • You need audio processing: Only E2B and E4B support native audio input. The 31B and 26B cannot hear anything.
  • You want to run AI entirely offline on your phone: Install the Google AI Edge Gallery app and pick E2B (2.5 GB) or E4B (3.6 GB). Everything runs locally, no data leaves your device.
  • You need the longest possible context: Only the 31B and 26B support 256K tokens. The edge models cap at 128K.
  • You want the absolute fastest time-to-first-token: E2B is the speed king, though E4B is close behind.

What to Check Right Now

  • Check your Mac’s unified memory (Apple menu, About This Mac). Match it to the hardware decision tree above to find your optimal model.
  • Install Ollama and try the model that fits your hardware:
    • ollama run gemma4:e2b – any Mac, any phone (3.2 GB)
    • ollama run gemma4:e4b – 8 GB+ Macs (5 GB)
    • ollama run gemma4:26b – 16 GB+ Macs, tight fit (15.6 GB)
    • ollama run gemma4:31b – 24 GB+ Macs (17.4 GB)
  • Try the Edge Gallery on your phone. Download the Google AI Edge Gallery (iOS and Android), grab E2B or E4B, and chat completely offline.
  • Compare against your paid model. Try your real prompts, not toy benchmarks. The 31B matches Claude Sonnet 4.5 on Arena; the E2B beats Gemma 3 27B on maths. Test them yourself.
  • Test the 26B MoE if you have the RAM. It is the best speed-to-quality ratio in the family: 44 tok/s on a 12 GB Nvidia card, and only 2-3% behind the 31B on benchmarks.
  • Watch for better quantisations and QAT releases. Unsloth, MLX Community, and other groups are actively improving the quantised variants. Quality improvements are still landing.
  • Take the Apache 2.0 licence seriously. Commercial use, modification, redistribution, and fine-tuning are all on the table for every model in the family.
AI model benchmarks and leaderboard data visualisation, dark analytical display
Gemma 4 31B sits among much larger frontier models on Arena AI, at a fraction of the size and cost.

Video Attribution


This article was inspired by David Ondrej’s video covering the Gemma 4 release. The analysis, benchmarks, architecture deep-dive, and Mac deployment guide are original research drawing from Google DeepMind’s official documentation, the Arena AI leaderboard, community testing, and the Hugging Face model card.

nJoy 😉

Context Graphs: The Knowledge Layer Your RAG Pipeline Is Missing (Or Does Not Need)

Your RAG pipeline is lying to you. Not maliciously, of course, but with the quiet confidence of a student who memorised the textbook’s index but never read a chapter. You feed it documents, it chunks them, embeds them, and when you ask a question it retrieves whichever chunks look “sort of similar” and hopes the LLM can stitch together a coherent answer. Sometimes it works. Sometimes it tells you Tokyo has 36 million people because it averaged two contradictory chunks. And you have no way to know which answer is real, because Vector RAG has no concept of “real”. It only knows “similar”. Context graphs are what happens when you decide similarity is not enough, and you want your AI to actually understand the relationships between things. TrustGraph just shipped a demo that shows exactly what that looks like in practice, and it is worth paying attention to.

Dark abstract visualization of interconnected knowledge graph nodes with cyan and amber connections on black background
Context graphs: where every node knows its neighbours and can prove where it got its information.

What Context Graphs Actually Are (and Why They Are Not Just Knowledge Graphs With a Rebrand)

A context graph is a knowledge graph that has been specifically engineered for consumption by AI models. That sentence sounds like marketing, so let us unpack it. A traditional knowledge graph stores millions of entities and relationships, optimised for human querying and data warehousing. Brilliant for analysts running SPARQL queries. Terrible for an LLM with a context window that starts forgetting things after a few thousand tokens.

Context graphs solve this by dynamically extracting focused subgraphs based on query relevance. Instead of dumping the entire graph into the prompt, you extract only the entities and relationships that matter for this specific question, scored by relevance, annotated with provenance, and formatted to minimise token waste. TrustGraph’s own documentation claims a 70% token reduction in their structured-versus-prose comparison. That number is plausible for the specific example they show (a simple entity lookup), but it is a vendor benchmark, not an independent evaluation, and the savings will vary dramatically depending on query complexity, graph density, and how much context the LLM actually needs.

“Context graphs are knowledge graphs specifically engineered and optimized for consumption by AI models. They extend traditional knowledge graphs by incorporating AI-specific optimizations like token efficiency, relevance ranking, provenance tracking, and hallucination reduction.” — TrustGraph, Context Graphs Guide

Think of the distinction this way. A knowledge graph is your entire library. A context graph is the specific stack of books your librarian pulls when you ask a particular question, each one bookmarked at the relevant page, with a note explaining why it was selected. The librarian remembers which shelf each book came from, when it was last updated, and how confident she is that the information is still correct. That is what provenance tracking and relevance scoring give you.

Here is the structural difference in compact form:

// Traditional knowledge graph: everything, all at once
{
  entities: [/* millions */],
  relationships: [/* tens of millions */]
}

// Context graph: query-specific, AI-optimised
{
  query: "Who leads TechCorp?",
  entities: [
    { name: "Alice Johnson", role: "CEO", relevance: 0.95 },
    { name: "TechCorp", industry: "Enterprise Software", relevance: 0.92 }
  ],
  relationships: [
    { from: "Alice Johnson", to: "TechCorp", type: "leads", relevance: 0.90 }
  ],
  metadata: { tokensUsed: 350, confidenceScore: 0.94, sources: ["hr_database"] }
}

The verbose natural-language equivalent of that context graph would cost 150 tokens. The structured version costs 45. Same information, a third of the price. As Martin Kleppmann writes in Designing Data-Intensive Applications, the way you structure your data determines what questions you can efficiently answer. Context graphs are structured specifically to answer LLM questions efficiently.

The TrustGraph Demo: London Pubs, Craft Beer, and Why Semantics Matter

The video “Context Graphs in Action” by TrustGraph co-founders Daniel Davis and Mark Adams is a 27-minute live demo. No slides. No marketing deck. They built a context graph from data about London pubs, restaurants, and event spaces, then demonstrated something deceptively simple that reveals the entire value proposition of this technology.

They asked two questions that any human would consider identical:

  1. “Where can I drink craft beer?”
  2. “Can you recommend a pub which serves craft beer?”

Both questions returned the same answer. But when they expanded the explainability trace, the paths through the graph were completely different. The first question, being open-ended, pulled in concepts from beer gardens, festivals, events, bars, cafes, and dozens of other venue types. The second question, with the word “pub” constraining the search, produced a far narrower traversal. The grounding concepts were different. The subgraph was different. The reasoning path was different. Only the final answer happened to converge.

This is the central insight the demo drives home: two questions that feel identical to a human are semantically distinct to a machine, and context graphs let you see exactly how and why. As Daniel puts it with characteristic bluntness: “If you ask a stupid question, you might get a stupid response.” The explainability trace lets you work backwards from a bad answer and determine whether the fault lay with the query, the data, or the retrieval path.

Dark diagram showing two query paths diverging through a knowledge graph, one broad and one narrow, cyan lines on black
Same answer, wildly different reasoning paths. The explainability trace reveals what Vector RAG hides.

What the Workbench Actually Shows

The demo walks through TrustGraph’s Workbench interface (accessible at localhost:8888 after deployment). Here is what they demonstrated:

  • Document ingestion: Plain text and PDF documents about London venues are uploaded through the Library page and processed through a GraphRAG flow. TrustGraph chunks the documents, extracts entities and relationships, generates vector embeddings, and builds the knowledge graph automatically.
  • Vector search entry points: Searching for “Bermondsey” returns semantically similar terms. Clicking a result reveals the fabric of the graph: Bermondsey tube station connects to the Jubilee line, which has a type “transport line”. You can navigate relationships in 3D space.
  • 3D graph visualisation: Interactive three-dimensional exploration of graph nodes and edges. Not intended for end users (Daniel jokes it would “send everybody over the edge insane”), but invaluable for understanding graph structure during development.
  • Explainability traces: Every query records a full reasoning trace. You can see: the original query, which concepts were extracted, which graph nodes matched, which edges were traversed, why each piece of evidence was selected (with the LLM’s reasoning), and the final synthesis. All traceable back to source documents.
  • Source provenance: Every fact in the graph links back to the specific document chunk it was extracted from. You can verify: where did this information come from? When was it ingested? Is it out of date? Do we trust this source?

The Ontology Question

Mark Adams demonstrates both approaches: schema-free extraction (GraphRAG) where the LLM discovers relationships freeform, and ontology-driven extraction (OntologyRAG) where a predefined schema forces precision. For the London venues demo, the ontology defines classes like “atmosphere” (cozy, creative, community spirit), “city”, “neighbourhood”, “event”, and constrains the relationships the graph will accept.

The result with ontologies is significantly more precise. Without an ontology, the LLM sometimes creates duplicate relationships with different names for the same concept. With an ontology, you control the vocabulary, and precision goes up. As Mark explains: “We force it into a much more precise structure.”

TrustGraph sits firmly in the RDF ecosystem rather than the property graph world (Neo4j and similar). The rationale: RDF supports reification (attaching metadata to edges themselves), multi-language representations, and the OWL/SKOS ontology standards natively. These features are essential for explainability and provenance tracking.

But let us be honest about the trade-offs. RDF comes with real costs. SPARQL is notoriously harder to learn than Cypher (Neo4j’s query language). OWL ontologies require domain experts to design and maintain, and they become a governance burden as your data evolves. Property graphs with Neo4j or Memgraph are simpler to reason about, faster for most traversal patterns, and have much larger developer ecosystems. TrustGraph’s choice of RDF is defensible for provenance-heavy enterprise use cases, but it is not the only valid architecture, and for many teams a property graph with LangGraph or LlamaIndex’s knowledge graph module will be simpler to operate and good enough.

The Broader Landscape: TrustGraph Did Not Invent This

Before we go further, some necessary context. The idea of using knowledge graphs to ground LLM responses is not new, and “context graph” is not a category that TrustGraph created from scratch. It is a refined evolution of work that has been shipping in production since late 2024.

Microsoft GraphRAG published the foundational “From Local to Global” paper in April 2024, introducing community-based summarisation of knowledge graphs for query-focused retrieval. Their approach extracts entities and relationships, clusters them into hierarchical communities using the Leiden algorithm, then pre-generates summaries at each level. It is open source, integrates with Neo4j, and has an Azure solution accelerator. Microsoft also shipped LazyGraphRAG (November 2024) to address the cost problem, and BenchmarkQED (June 2025) for automated RAG evaluation.

Neo4j + LangChain/LangGraph is arguably the most widely deployed graph RAG stack in production today. Neo4j’s property graph model with Cypher queries is simpler to learn than SPARQL, has a massive developer community, and integrates directly with LangChain’s retrieval chains. For teams already running Neo4j, adding graph-enhanced RAG requires no new infrastructure.

LlamaIndex Knowledge Graphs provides a Python-native graph RAG pipeline that works with Neo4j, Nebula Graph, and others. It handles entity extraction, graph construction, and hybrid vector+graph retrieval with significantly less operational complexity than a full RDF stack.

What TrustGraph adds to this landscape is specifically the combination of RDF-native ontology support, built-in explainability traces, portable context cores, and multi-model storage (Cassandra, Qdrant, etc.) in a single open-source platform. These are genuine differentiators for provenance-heavy enterprise use cases. But if you do not need ontology enforcement or full reasoning traces, the simpler alternatives above will get you 80% of the benefit at 20% of the operational complexity.

Where Vector RAG Falls Apart (and Context Graphs Save You)

Vector RAG seemed like the answer to everything when embeddings first became cheap. Embed your documents, find similar chunks, feed them to the LLM. Fast, simple, works for demos. Then you deploy it in production and discover the failure modes.

Case 1: The Averaging Problem

You embed two documents. One says “Tokyo’s population is 37.4 million.” The other says “Tokyo has about 35 million people.” Both are semantically similar to the query “What is Tokyo’s population?” The LLM sees both chunks and generates something in between. Maybe 36 million. Confidently wrong.

// Vector RAG retrieval for "What is Tokyo's population?"
chunk_1: "Tokyo's population is 37.4 million" (similarity: 0.94)
chunk_2: "Tokyo has about 35 million people" (similarity: 0.92)
// LLM output: "Tokyo has approximately 36 million people" -- wrong

// Context graph retrieval
node: Tokyo { population: 37400000, source: "UN World Population Prospects 2024",
              confidence: 1.0, lastVerified: "2024-07-01" }
// LLM output: "Tokyo's population is 37.4 million" -- correct, sourced, verifiable

A graph stores one value. The correct value. With a source and a timestamp. No ambiguity, no averaging, no hallucination.

Case 2: The Multi-Hop Blindness

Ask Vector RAG: “How does climate change affect AI research funding?” It needs to traverse: climate change affects government priorities, which influence research funding allocation, which supports AI research. Each of those facts lives in a different document. Vector RAG retrieves chunks that are individually similar to the question but cannot connect them into a reasoning chain.

// Vector RAG: retrieves 3 chunks that mention some of these concepts
// but cannot chain: climate -> govt priorities -> funding -> AI research
// Result: vague, hedge-filled answer

// GraphRAG: traverses the reasoning path
climate_change --[affects]--> government_priorities
government_priorities --[influences]--> research_funding
research_funding --[supports]--> ai_research
// Result: specific, grounded answer with full provenance chain

Independent benchmarks from Iterathon’s 2026 enterprise guide report GraphRAG achieving 83-87% accuracy on complex multi-hop queries versus Vector RAG’s 68-72%. Microsoft’s own evaluation found GraphRAG improved comprehensiveness by 26% and diversity by 57% over standard vector retrieval. These numbers are promising, but a caveat: most published benchmarks come from vendors or researchers with a stake in the outcome. Independent, apples-to-apples comparisons across Microsoft GraphRAG, Neo4j + LangChain, LlamaIndex, and TrustGraph on the same dataset remain conspicuously absent from the literature.

Case 3: The Lost-in-the-Middle Catastrophe

Here is the one that should worry every engineer relying on long context windows as a substitute for proper retrieval. Research by Liu et al. at Stanford demonstrated that LLMs consistently fail to use information placed in the middle of long contexts, even when the context window is enormous.

“Language models exhibit significantly degraded performance when relevant information is positioned in the middle of long contexts, even for models explicitly designed for long-context processing.” — Liu et al., “Lost in the Middle: How Language Models Use Long Contexts”, TACL 2024

TrustGraph’s own testing confirms this pattern holds across models. Chunks of 1,000 tokens extracted 2,153 graph edges. Chunks of 8,000 tokens extracted only 1,352. That is a 59% increase in extracted knowledge just from chunking smaller, using only 4% of the available context window. At 500 tokens, the system extracted 2,975 edges, a 120% improvement over 8,000-token chunks. This pattern held across eight models from six providers: Claude, Gemini, Mistral, Cohere, Llama, and others.

Long context windows do not work. Not because the models are bad, but because the transformer attention mechanism dilutes focus as token count rises. This appears to be inherent to the architecture itself. Context graphs sidestep the problem entirely: instead of cramming everything into a massive context, you extract a small, focused, structured subgraph. The LLM gets exactly what it needs and nothing else.

Dark chart showing declining knowledge extraction as chunk size increases, with cyan data points on black background
Bigger context windows, worse extraction. The lost-in-the-middle effect is real and it is not getting better.

How to Actually Deploy This: From Zero to Context Graph

TrustGraph is open source (Apache 2.0) and deploys via Docker Compose in minutes. Here is the real pipeline, not the marketing version:

Step 1: Configure and Deploy

# Install and configure TrustGraph
npx @trustgraph/config

# Interactive prompts:
# ? Select your LLM provider: Anthropic / OpenAI / Google / Mistral / Ollama
# ? Select deployment target: Docker / Kubernetes / Minikube
# Generates docker-compose.yaml and INSTALLATION.md

# Deploy
docker compose up -d

# Workbench available at http://localhost:8888
# Grafana monitoring at http://localhost:3000

Step 2: Ingest Documents and Build the Graph

# Create a collection
tg-set-collection \
  -n "Company Docs" \
  -d "Internal documentation" \
  company-docs

# Add a document
tg-add-library-document \
  --name "Security Policy 2025" \
  --id doc-security-2025 \
  --kind application/pdf \
  documents/security-policy.pdf

# Create a GraphRAG flow (no ontology needed)
tg-start-flow \
  -n graph-rag \
  -i security-graphrag \
  -d "Security document knowledge extraction"

# Process the document
tg-start-library-processing \
  --flow-id security-graphrag \
  --document-id doc-security-2025 \
  --collection company-docs

Step 3: Query With Explainability

# GraphRAG query with full provenance
tg-invoke-graph-rag \
  -f security-graphrag \
  -C company-docs \
  -q "What are our top cybersecurity vulnerabilities?"

# Or via the REST API
curl -X POST http://localhost:8001/api/invoke/graph-rag \
  -H "Content-Type: application/json" \
  -d '{
    "flow-id": "security-graphrag",
    "collection": "company-docs",
    "query": "What are our top cybersecurity vulnerabilities?",
    "max-entities": 50,
    "relevance-threshold": 0.7,
    "include-provenance": true
  }'

The TypeScript client library (@trustgraph/client) provides WebSocket-based real-time communication for building production UIs. Python and CLI interfaces are also available.

Step 4: Add Ontologies for Precision (Optional but Recommended)

# Upload an OWL ontology
cat domain-ontology.owl | tg-put-config-item \
  --type ontology \
  --key security-ontology \
  --stdin

# Create an OntologyRAG flow
tg-start-flow \
  -n onto-rag \
  -i security-onto-rag \
  -d "Ontology-driven security knowledge extraction"

# Process with ontology enforcement
tg-start-library-processing \
  --flow-id security-onto-rag \
  --document-id doc-security-2025 \
  --collection company-docs

The Unglamorous Reality: What Graph RAG Actually Costs You

Every GraphRAG vendor demo shows the happy path. Here is what they leave out.

Ingestion Is Expensive and Slow

Building a knowledge graph requires running every document chunk through an LLM for entity and relationship extraction. This is not free. Microsoft’s original GraphRAG architecture dedicates roughly 75% of total indexing cost to graph extraction alone. One production deployment reported $33,000 in indexing costs for a large dataset before a single query was run. A 10,000-document corpus that costs under $5 to embed in a vector database costs $50-200 to process through a GraphRAG pipeline. For context: that is a 10-40x cost multiplier at ingestion time.

Entity Resolution Is the Silent Killer

When your LLM extracts entities from thousands of documents, it will create duplicates. “IBM”, “International Business Machines”, “IBM Corp”, and “Big Blue” are all the same entity. If your entity resolution accuracy drops below roughly 85%, the errors compound exponentially through multi-hop queries. At 85% accuracy with 5 hops, fewer than half your answers remain trustworthy (0.85^5 = 44%). This is not a theoretical problem; it is the most common failure mode in production GraphRAG systems, and neither TrustGraph nor anyone else has fully solved it.

Ontology Maintenance Is a Governance Burden

TrustGraph’s OntologyRAG produces more precise graphs, no question. But someone has to design that ontology, maintain it as your domain evolves, and ensure new documents conform to the schema. In practice, this means a dedicated knowledge engineer or a committee that reviews and updates the ontology quarterly. For organisations that already struggle to maintain a data dictionary, adding OWL ontology governance is a non-trivial ask.

Three Indexes, Three Consistency Problems

Production graph RAG requires keeping three synchronized stores: a graph index for structural traversal, a vector index for semantic similarity, and often a text index for full-text search. Every document addition, update, or deletion must propagate across all three and trigger entity resolution re-evaluation. This is, bluntly, a data engineering nightmare that most demos conveniently skip.

Extraction Hallucinations Are Real

The LLM that extracts entities and relationships from your documents will hallucinate some of them. It will invent relationships that do not exist in the source text, misattribute properties, and occasionally create phantom entities. These extraction hallucinations then become “facts” in your knowledge graph, where they are retrieved with the same confidence score as legitimate data. Garbage in, graph out. Every production deployment needs a quality assurance pipeline to catch extraction errors, and most teams underestimate this effort.

Query Latency Is Not Milliseconds

Vector search returns results in single-digit milliseconds. Graph RAG queries involve: vector lookup to find entry points, graph traversal across multiple hops, LLM-based relevance scoring of candidate edges, subgraph assembly, and finally LLM generation. End-to-end latency is typically 2-15 seconds depending on graph size and traversal depth. For interactive applications where users expect sub-second responses, this is a hard constraint that no amount of clever engineering fully eliminates.

When Context Graphs Are Essential (Real Use Cases)

Context graphs are not a universal hammer. They are a precision instrument for specific categories of problem. Here is where they earn their keep:

  • Financial compliance and audit: A financial analyst querying regulatory exposure across multiple counterparties needs multi-hop reasoning across hundreds of documents. Every answer must be traceable to source documents for regulatory compliance. SowFin, a corporate finance company, uses TrustGraph to bring accurate, explainable insights to corporate finance.
  • Security operations: Huntbase uses TrustGraph to build Context Cores for SecOps, where AI hallucinations in threat detection are not just inconvenient but dangerous. Cybersecurity requires connecting events, metadata, and threat indicators across thousands of log entries with full provenance.
  • Medical and clinical research: Clinical informaticists analysing treatment interactions across patient comorbidities need graph traversal to connect drugs, conditions, contraindications, and outcomes across multiple clinical databases. Approximate similarity search is not acceptable when lives are involved.
  • Supply chain management: Tracing component dependencies multiple tiers deep requires genuine relationship traversal. “Which suppliers are affected if factory X in Shenzhen shuts down?” demands multi-hop graph queries that Vector RAG simply cannot do.
  • Legal document analysis: Connecting clauses across contracts, precedents across cases, and regulations across jurisdictions. Every connection must be verifiable and traceable.
  • Enterprise knowledge management: The “monograph” approach (a single unified graph across all your organisation’s knowledge) enables discovery of relationships across departments and domains that siloed systems miss. This is not unique to TrustGraph; any sufficiently connected knowledge graph achieves this, whether built with Neo4j, Microsoft GraphRAG, or TrustGraph.

When Context Graphs Are Overkill (Be Honest With Yourself)

Now for the part that most GraphRAG vendors would rather you did not read. Context graphs are genuinely overkill for a significant number of common AI use cases. Using one when you do not need one is like hiring a structural engineer to hang a picture frame.

  • Small datasets that fit in context: If your entire corpus is under 50 pages (roughly 40,000 tokens), skip RAG entirely. Stuff it all into the prompt. It costs $0.01 per query versus $0.05 for a RAG pipeline, deploys in a day versus four weeks, and the LLM can attend to all of it directly. No chunking, no embeddings, no graph. Simple prompt engineering wins.
  • General knowledge queries: Questions the LLM already knows the answer to (world history, common programming patterns, basic science) gain nothing from RAG. You are adding latency without improving accuracy.
  • Simple semantic lookup: “Find me documents similar to this one.” A vector store alone is faster, cheaper, and simpler. You do not need graph traversal for similarity search.
  • Ephemeral data with unstable entities: If your corpus changes hourly and the entities and relationships are not stable enough to maintain, the cost of continuous knowledge extraction will exceed the value. A vector store with frequent re-indexing may be more practical.
  • Speed-critical applications: Vector RAG delivers millisecond responses. GraphRAG takes seconds, sometimes minutes for complex traversals. If sub-100ms latency is a hard requirement, graphs add unacceptable overhead.
  • Prototyping and MVPs: Vector RAG takes hours to set up. A full knowledge graph pipeline takes weeks. For a proof of concept, start with Vector RAG and upgrade to GraphRAG only when you have evidence that relationship-aware retrieval would improve your results.
  • Single-fact lookup: “What is the capital of France?” Both approaches achieve 94-95% accuracy on simple factual queries. The graph adds no value here.

The honest decision matrix: if your questions require understanding relationships between entities, connecting information across multiple documents, or producing explainable, auditable answers, you need a graph. But “need a graph” does not mean “need TrustGraph specifically”. A Neo4j instance with LangChain retrieval chains, Microsoft GraphRAG with community summaries, or LlamaIndex’s knowledge graph module may be simpler to deploy, cheaper to run, and sufficient for your use case. Evaluate the alternatives before committing to the heaviest solution. And if your data fits in a context window, you might not need RAG at all.

The Neuro-Symbolic Promise (and Why This Actually Matters)

Daniel Davis makes a point in the demo that deserves its own section. The deep learning camp believed that enough data and compute would magically produce ground truth. Throw enough parameters at the problem and the model would learn to reason. The neuro-symbolic camp argued you would always need richer semantic structures because language is fundamentally ambiguous, and statistical pattern matching cannot resolve that ambiguity alone.

Context graphs are the practical vindication of the neuro-symbolic position. The LLM handles what it is good at: understanding natural language queries, interpreting intent, generating fluent responses. The graph handles what it is good at: storing precise facts, maintaining relationships, providing provenance, enabling deterministic traversal. Neither can solve the full problem alone. Together they produce something that neither approach could achieve independently.

This division of labour, as described in the TrustGraph demo, is not just a technical architecture decision. It is a philosophical one about what AI systems should and should not be trusted to do. LLMs should generate language. They should not be trusted as databases. Graphs should store and retrieve facts. They should not be expected to understand natural language. Each doing what it does best: that is the future of reliable AI systems.

Other Resources Worth Watching

The TrustGraph video is one perspective in a rapidly maturing field. These resources provide alternative viewpoints and competing approaches:

What to Check Right Now

  • Audit your current RAG pipeline’s failure modes. Ask it multi-hop questions that require connecting information across documents. If it fails or hallucinates, you have a graph-shaped problem.
  • Test the “same question, different words” scenario. Ask semantically equivalent questions and compare outputs. If the answers diverge wildly, your retrieval layer lacks semantic understanding.
  • Measure your chunk sizes. If you are chunking above 1,000 tokens, you are likely losing information to the lost-in-the-middle effect. Consider chunking at 500-1,000 tokens regardless of your context window size.
  • Evaluate whether you actually need a graph. Run the honest assessment: does your use case require multi-hop reasoning, explainability, or relationship traversal? If not, a well-tuned Vector RAG pipeline might be all you need.
  • Try TrustGraph locally. Run npx @trustgraph/config, choose Docker, and docker compose up -d. Load a few documents and explore the Workbench. You can have a working context graph in under an hour. It is free and open source (Apache 2.0).
  • Check your explainability requirements. If you are building for regulated industries (finance, healthcare, legal), ask whether you can trace every AI-generated answer back to its source documents. If the answer is no, context graphs are not optional, they are mandatory.
Dark minimalist checklist visualization with glowing cyan checkmarks on black, tech aesthetic
The real question is not whether context graphs are useful. It is whether your use case demands them.

Video Attribution

This article is based on the TrustGraph demo “Context Graphs in Action” by Daniel Davis and Mark Adams. The video demonstrates TrustGraph 2’s context graph capabilities, explainability features, and source provenance using a London venues dataset. No marketing, no hype, just a real demo of real context graphs.


TrustGraph is open source and available at github.com/trustgraph-ai/trustgraph. Documentation at docs.trustgraph.ai. Community on Discord.

nJoy 😉

Google’s TurboQuant Just Halved the Cost of Running Every AI Model on Earth

Google just published a compression algorithm so efficient that it sent memory chip stocks tumbling across three continents in a single trading session. SK Hynix down 6%. Samsung down 5%. Micron bleeding for six days straight. Billions of dollars in market capitalisation evaporated because a team of researchers figured out a cleverer way to point at things. That is not a metaphor. That is literally what they did. Welcome to TurboQuant, the algorithm that halves the cost of running every large language model on the planet, and the wildest part is that Google just gave it away for free.

Dark abstract visualization of AI memory compression with polar coordinates, cyan and deep blue vectors converging on black background
TurboQuant: pointing instead of giving directions

What the KV Cache Actually Is (And Why Everyone Should Care)

Before we get into what Google built, you need to understand the bottleneck they solved. Every large language model, whether it is ChatGPT, Claude, Gemini, or Llama, runs on the transformer architecture. And transformers have this mechanism called attention, which is how the model figures out what words mean in context.

Here is a quick thought experiment. If I say “it was tired,” you have no idea what “it” refers to. A dog? A server? A metaphor for the state of modern JavaScript? But if I say “the animal didn’t cross the street because it was too tired,” suddenly “it” is loaded with meaning. It is an animal. It didn’t cross. It was tired. Your brain just did what transformers do: it looked at the surrounding words to figure out what one word actually means.

The problem is that transformers need to remember these relationships. Every time the model processes a token, it calculates how that token relates to every other token it has seen so far. These relationships get stored in what is called the key-value cache (KV cache). Think of it as a filing cabinet. Each “folder” has a label on the front (the key, which is a rough tag so the model can find it quickly) and detailed notes inside (the value, which is the actual rich meaning and relationships).

The catch? This filing cabinet grows linearly with context length. A 128K context window means 128,000 tokens worth of folders, each containing high-dimensional vectors stored at 16-bit precision. For a model like Llama 3.1 with 8 billion parameters, the KV cache alone can eat several gigabytes of GPU memory. For larger models with longer contexts, it becomes the single biggest memory bottleneck in the entire inference pipeline. Not the model weights. Not the activations. The KV cache.

“Vector quantization is a powerful, classical data compression technique that reduces the size of high-dimensional vectors. This optimization addresses two critical facets of AI: it enhances vector search […] and it helps unclog key-value cache bottlenecks by reducing the size of key-value pairs.” — Google Research, TurboQuant Blog Post (March 2026)

Traditional approaches to compressing the KV cache use something called quantisation, which reduces the precision of the stored numbers. Instead of 16 bits per value, you use 8 bits, or 4 bits. The problem is that most quantisation methods need to store calibration constants (a zero point and a scale factor) for every small block of data. These constants have to be stored at full precision, which adds 1-2 extra bits per number. You are trying to compress, but your compression metadata is eating into your savings. It is like buying a wallet so expensive it defeats the purpose of saving money.

PolarQuant: The Art of Pointing Instead of Giving Directions

This is where Google’s insight gets genuinely elegant. Imagine you are standing in a city and someone asks you how to get to an office on the third floor of a building two blocks east and three blocks north. The standard approach is step-by-step Cartesian directions: go two blocks east, then three blocks north, then up three floors. Each dimension gets its own coordinate.

But there is another way. You could just point at the building and say “it is 500 feet away in that direction.” One angle, one distance. Same destination, less information to store.

That is PolarQuant. Instead of storing each dimension of a vector independently (the Cartesian way), it converts the vector into polar coordinates: a radius (how strong or important the data is) and an angle (what direction it points in, which encodes its meaning).

“Instead of looking at a memory vector using standard coordinates that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates […] This is comparable to replacing ‘Go 3 blocks East, 4 blocks North’ with ‘Go 5 blocks total at a 37-degree angle’.” — Google Research, TurboQuant Blog Post

Why is this so much more compressible? Here is the key mathematical insight. When you randomly rotate high-dimensional vectors (which is PolarQuant’s first step), something beautiful happens: the coordinates follow a concentrated Beta distribution. In plain English, the angles cluster tightly into a predictable, narrow range. They are not scattered randomly across all possible values. They bunch up.

This means the model no longer needs to perform expensive data normalisation. Traditional methods map data onto a “square” grid where the boundaries change constantly and need to be recalculated and stored for every block. PolarQuant maps data onto a fixed, predictable “circular” grid where the boundaries are already known. No calibration constants needed. No overhead.

Here is a concrete way to think about it. Imagine you are mapping people on a 2D chart where the X-axis is age and the Y-axis represents some semantic concept. In Cartesian coordinates, you store (x, y) for each person. In polar coordinates, you store (distance from origin, angle). The angle between “grandmother” and “grandfather” is predictable. The angle between “boy” and “girl” is predictable. These patterns are exploitable for compression precisely because they are so regular in high dimensions.

// Cartesian: store each dimension independently
// For a d-dimensional vector, you need d values at full precision
const cartesian = { x: 3.14159, y: 2.71828, z: 1.41421 };
// Plus quantisation overhead: zero_point + scale per block
// Adds 1-2 extra bits per value

// Polar (PolarQuant): store radius + angles
// After random rotation, angles are tightly concentrated
// No calibration constants needed
const polar = { radius: 4.358, angle_1: 0.7137, angle_2: 0.3927 };
// The angles live in a predictable, narrow range
// Quantise directly onto a fixed grid -- zero overhead
Dark technical diagram showing Cartesian to polar coordinate transformation, amber vectors on deep blue grid, black background
From step-by-step directions to a single compass bearing

QJL: The 1-Bit Error Checker That Makes It Lossless

PolarQuant does the heavy lifting. It is responsible for the bulk of the compression. But no compression is perfect, and PolarQuant leaves behind a tiny residual error. This is where the second component comes in, and it is arguably just as clever.

The Quantised Johnson-Lindenstrauss (QJL) algorithm takes the small error left over from PolarQuant and squashes it down to a single sign bit per value: +1 or -1. That is it. One bit. The technique is based on the Johnson-Lindenstrauss lemma, a foundational result in dimensionality reduction that says you can project high-dimensional data into a much lower-dimensional space whilst preserving the distances between points.

What QJL does specifically is eliminate bias in the inner product estimation. This is critical because attention scores in transformers are computed as inner products (dot products) between query and key vectors. If your compression introduces a systematic bias in these dot products, the model’s attention mechanism starts paying attention to the wrong things. It is like having a compass that is consistently off by 3 degrees; every direction you follow drifts further from where you actually want to go.

QJL uses a special estimator that balances a high-precision query vector against the low-precision compressed data. The result is an unbiased inner product estimate with zero memory overhead. The 1-bit correction is so small it is essentially free to store, but it perfectly cancels out the residual error from PolarQuant.

// Stage 1: PolarQuant (main compression)
// 16-bit KV cache -> ~3 bits per channel
// Does most of the heavy lifting
// Tiny residual error remains

// Stage 2: QJL (error correction)
// Takes the residual from PolarQuant
// Reduces it to 1 sign bit (+1 or -1) per value
// Eliminates bias in attention score computation
// Memory overhead: essentially zero

// Combined: TurboQuant
// 3-bit KV cache with ZERO accuracy loss
// No retraining, no fine-tuning, no calibration
// Just swap it in and the model stays identical

Together, PolarQuant + QJL = TurboQuant. The compression engine and its error checker. The paper proves that TurboQuant achieves distortion rates within a factor of approximately 2.7 of the information-theoretic lower bound, the absolute mathematical limit of how well any quantiser could ever perform. In the language of information theory, this is approaching the Shannon limit. There is not much room left to improve.

“We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant (approx 2.7) factor.” — Zandieh et al., TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate, arXiv:2504.19874

The Numbers: What TurboQuant Actually Delivers

Theory is nice, but what actually happened when they tested this on real hardware with real models? Google ran TurboQuant through a gauntlet of benchmarks on open-source models (Gemma, Mistral, Llama) running on NVIDIA H100 GPUs. The results are not incremental. They are a step change.

The Headline Numbers

  • 6x KV cache memory reduction. A cache that previously required 16 bits per value now needs under 3 bits. On a model that was using 6 GB of KV cache memory, you now need roughly 1 GB.
  • Up to 8x attention speedup. The attention computation (the most expensive part of inference) runs up to 8 times faster on H100 GPUs. This does not mean the entire model is 8x faster, but the bottleneck operation is.
  • Zero accuracy loss. At 3.5 bits per channel, TurboQuant achieves what the authors call “absolute quality neutrality.” The compressed model produces identical results to the uncompressed model. Even at 2.5 bits per channel, degradation is marginal.
  • No retraining required. This is not a new model architecture. There is no fine-tuning step, no calibration dataset, no model-specific tuning. You slot TurboQuant into the inference pipeline and the existing model just works better.

Benchmark Breakdown

The team tested across five major long-context benchmarks:

  • LongBench — question answering, summarisation, code generation across diverse tasks
  • Needle in a Haystack — finding one specific piece of information buried in massive documents
  • ZeroSCROLLS — long-document understanding tasks
  • RULER — synthetic benchmarks that stress-test context window utilisation
  • L-Eval — comprehensive evaluation of long-context capabilities

Across all of them, TurboQuant achieved perfect downstream results whilst reducing KV cache memory by at least 6x. PolarQuant alone was nearly lossless. With QJL added on top, it became mathematically unbiased.

Dark performance chart showing compression ratios and speedup metrics, cyan bars on dark grid, minimal tech aesthetic on black
6x compression, 8x speedup, zero loss. The rare triple.

The Stock Market Bloodbath (And Why Analysts Say Calm Down)

Google published TurboQuant on 24 March 2026. Within 48 hours, billions of dollars had been wiped off memory chip stocks across three continents.

The logic seemed straightforward: if AI models need 6x less memory, companies that make memory chips are going to sell fewer chips. Right?

The Damage Report

  • SK Hynix (South Korea) — down 6.23%
  • Samsung (South Korea) — down nearly 5%
  • Kioxia (Japan) — down nearly 6%
  • Micron (USA) — down over 20% across six trading sessions
  • SanDisk (USA) — down 11%
  • Western Digital (USA) — down 6.7%
  • Seagate (USA) — down 8.5%

The broader Korean KOSPI index fell as much as 3%. Matthew Prince, CEO of Cloudflare, called it “Google’s DeepSeek moment,” referencing the January 2025 DeepSeek sell-off that wiped nearly a trillion dollars off the Nasdaq.

But here is the thing. Analysts are not panicking. In fact, most of them are telling investors to buy the dip.

Ray Wang, a memory analyst at SemiAnalysis, told CNBC:

“When you address a bottleneck, you are going to help AI hardware to be more capable. And the training model will be more powerful in the future. When the model becomes more powerful, you require better hardware to support it.” — Ray Wang, SemiAnalysis, via CNBC (March 2026)

Ben Barringer, head of technology research at Quilter Cheviot, was even more direct: “Memory stocks have had a very strong run and this is a highly cyclical sector, so investors were already looking for reasons to take profit. The Google Turboquant innovation has added to the pressure, but this is evolutionary, not revolutionary. It does not alter the industry’s long-term demand picture.”

For context, memory stocks had been on an absolute tear before this. Samsung was up nearly 200% over the prior year. SK Hynix and Micron were up over 300%. A correction was arguably overdue, and TurboQuant gave skittish investors the excuse they needed.

Jevons Paradox: Why Efficiency Makes You Use More, Not Less

The most important framework for understanding TurboQuant’s long-term impact is not computer science. It is economics. Specifically, a concept from 1865.

In The Coal Question, economist William Stanley Jevons documented something counterintuitive: when James Watt’s innovations made steam engines dramatically more fuel-efficient, Britain’s coal consumption did not fall. It increased tenfold. The efficiency gains lowered coal’s effective cost, which made it economical for new applications and industries. The per-unit savings were overwhelmed by the explosion in total usage.

This is the Jevons paradox, and it has been playing out in AI with striking precision. Between late 2022 and 2025, the cost of running large language models collapsed roughly a thousandfold. GPT-4-equivalent performance dropped from $20 to $0.40 per million tokens. Did people use fewer tokens? Enterprise generative AI spending skyrocketed from $11.5 billion in 2024 to $37 billion in 2025, a 320% increase. When OpenAI dropped API prices by 10x, API calls grew 100x.

The same pattern will almost certainly play out with TurboQuant. If it suddenly costs half as much to run a Frontier model, companies will not pocket the savings and go home. They will run bigger models, longer contexts, more agents, more concurrent sessions. Workloads that were previously too expensive become viable. The 200K-context analysis that cost too much to justify? Now it makes business sense. The always-on AI assistant that was too expensive to run 24/7? Now it is affordable.

Morgan Stanley’s analysts made exactly this argument, citing Jevons paradox to characterise the long-term impact on storage demand as “neutral to positive.” The market overpriced the short-term headline and underpriced the second-order effects.

What This Means for Anyone Using AI Right Now

Let us get concrete about who benefits and how.

Enterprises Running Models at Scale

If you are an enterprise running large language models in production, TurboQuant translates roughly to a 50% reduction in inference costs. This is not a marginal optimisation. This applies to every prompt, every API call, every chatbot response, every agentic workflow. API calls get cheaper. Faster responses. More requests per second on the same hardware. The ability to run longer context windows without hitting memory limits.

Context Windows Get Bigger on the Same Hardware

If a GPU was maxing out at a certain context length because the KV cache filled the available memory, TurboQuant effectively multiplies the available context by 6x. A model that topped out at 32K tokens on a given GPU could now handle 192K tokens. This is significant for code analysis, legal document review, medical record processing, and any workload where more context means better output.

The Anthropic Mythos Situation

Anthropic’s upcoming Mythos model has been described as “very expensive for us to serve, and will be very expensive for our customers to use.” Early pricing estimates suggest 2-5x the cost of Claude Opus. TurboQuant could meaningfully change that calculus. If inference costs drop by half, a model that was borderline unviable for production use cases suddenly becomes economically defensible. Whether Anthropic adopts TurboQuant specifically or implements similar techniques, the pressure to do so just became enormous.

Individual Power Users

Andrej Karpathy, former Tesla AI lead and OpenAI researcher, recently said in an interview that he gets “nervous when I have subscription left over” because “that just means I haven’t maximised my token throughput.” He now runs multiple AI agents in parallel across separate repository branches, treating token consumption as his primary productivity constraint. NVIDIA CEO Jensen Huang has said he expects employees earning $500,000 to use $250,000 worth of tokens. If TurboQuant halves the cost of those tokens, the effective value of every subscription doubles overnight.

Dark futuristic visualization of AI agents running in parallel across GPU clusters, purple and cyan glow on black background
Same hardware, twice the output. The new math of AI inference.

Google’s Quiet Giant Move: Why They Published Instead of Hoarding

There is a pattern here that deserves attention. In 2017, a team at Google published “Attention Is All You Need” by Vaswani et al., the paper that introduced the transformer architecture. That single paper became the foundation for GPT, Claude, Gemini, Llama, Mistral, and essentially every large language model in existence. Most of Google’s competitors are built on Google’s published research.

They did it again with TurboQuant. Google could have kept this internal. They could have quietly deployed it across their infrastructure, pocketed the 50% cost savings on Gemini inference, and used the competitive advantage to undercut everyone else on pricing. That is the standard playbook. But they published it. The paper is on arXiv. The blog post explains the technique in detail. Community implementations appeared on PyPI and GitHub within days.

This is not altruism (Google benefits enormously from being the company that publishes foundational research, and they have the infrastructure to move fastest on their own inventions). But the effect is real. Every company running AI models, every open-source project, every independent developer benefits from this work being public.

As Martin Kleppmann writes in Designing Data-Intensive Applications, the most impactful systems are often the ones that reduce the cost of doing something by an order of magnitude, because they do not just make existing use cases cheaper; they create entirely new categories of application that were previously uneconomical. TurboQuant is precisely that kind of step change.

When TurboQuant Does Not Apply (The Honest Bit)

No article from this site would be credible without the caveats section, so here they are:

Case 1: Training Is Untouched

TurboQuant is an inference optimisation. It compresses the KV cache, which is used during inference (when the model generates responses). It does not reduce the cost of training a model. The multi-billion-dollar GPU clusters that companies like Google, OpenAI, and Meta use to train Frontier models are not affected. Training has its own bottlenecks (gradient accumulation, all-reduce communication, activation memory), and TurboQuant addresses none of them.

Case 2: It Only Compresses the KV Cache

The 6x memory reduction applies specifically to the KV cache, not to the model weights, not to the activations, and not to the total GPU memory usage. For many inference workloads, the KV cache is the dominant memory consumer, especially at long context lengths. But for short prompts on large models, the model weights themselves might be the bottleneck. TurboQuant helps a lot in the first scenario and less in the second.

Case 3: You Still Need GPUs

TurboQuant makes existing hardware more efficient. It does not eliminate the need for GPUs (or TPUs). You still need compute to run models. What changes is how much work each GPU can do. Think of it as improving fuel efficiency in a car: you still need the car, and you still need fuel, but you go further on each tank.

Case 4: The 8x Speedup Is for Attention, Not End-to-End

The headline “8x speedup” refers to the attention computation specifically, not the total inference time. A full model forward pass includes many other operations (feedforward layers, layer norms, embedding lookups). The end-to-end speedup depends on what fraction of total inference time is spent on attention. For long-context workloads, it is a large fraction. For short prompts, less so.

How This Actually Gets Deployed

One of TurboQuant’s strongest properties is how easy it is to adopt. Unlike techniques that require retraining or fine-tuning, TurboQuant is data-oblivious: it works without any dataset-specific preprocessing. The deployment path looks like:

  1. No model changes. The model weights, architecture, and training are all untouched. TurboQuant operates entirely at the inference layer.
  2. Swap the KV cache quantiser. Replace the existing KV cache storage with TurboQuant’s polar coordinate quantisation. This is a software change in the inference engine.
  3. Choose your bit-width. At 3.5 bits per channel, you get zero accuracy loss. At 2.5 bits per channel, you get even more compression with marginal degradation. Pick based on your quality requirements.
  4. Deploy. Run the same prompts, get the same results, use 6x less KV cache memory, and compute attention up to 8x faster.

Community implementations have already appeared. A pip-installable turboquant package is on PyPI. Third-party implementations in MLX (for Apple Silicon) and Triton (for custom GPU kernels) were published within days of the announcement. The official Google code is expected in Q2 2026.

# Community implementation (illustrative)
# pip install turboquant
from turboquant import TurboQuantConfig, apply_turboquant

config = TurboQuantConfig(
    bits_per_channel=3.5,   # Zero accuracy loss
    enable_qjl=True,        # Error correction stage
)

# Apply to any HuggingFace model's KV cache
model = apply_turboquant(model, config)

# Inference runs as normal -- same API, same outputs
# But KV cache is now 6x smaller and attention is up to 8x faster
output = model.generate(input_ids, max_new_tokens=512)

What to Check Right Now

  • Audit your KV cache memory usage. If you are running models in production, profile how much GPU memory your KV cache consumes. If it is a significant fraction of total memory (common for long-context workloads), TurboQuant could give you an immediate and substantial improvement.
  • Watch for framework integration. Keep an eye on vLLM, TensorRT-LLM, and HuggingFace TGI for native TurboQuant support. Once it lands in these frameworks, adoption becomes a config flag.
  • Re-evaluate your context length limits. If you capped context length because of memory constraints, TurboQuant may let you lift those caps on existing hardware. Longer context often means better output quality.
  • Read the actual paper. The TurboQuant paper (arXiv:2504.19874) and the PolarQuant paper (arXiv:2502.02617) are both well-written and surprisingly accessible. The Google Research blog post is an excellent entry point if you want the intuition without the proofs.
  • Don’t panic-sell memory stocks based on headlines. The Jevons paradox has held true for every major compute efficiency improvement in history. Efficiency does not reduce demand; it creates it. The analysts calling this “evolutionary, not revolutionary” for the memory industry are probably right.
  • Try it yourself. The community turboquant PyPI package and the turboquant-pytorch GitHub repo let you test it on your own models today.

Video Attribution

This article was inspired by Wes Roth’s excellent breakdown of TurboQuant. Watch the full video below:


nJoy 😉