Every few years, something happens in computing that quietly reshapes everything around it. The UNIX pipe. HTTP. REST. The transformer architecture. And now, in 2026, the Model Context Protocol. If you build software and you haven’t internalised MCP yet, this is your moment. This course will fix that – thoroughly.
The MCP ecosystem: hosts, clients, and servers unified under a single open protocol.
What This Course Is
This is a full university-grade course on the Model Context Protocol – the open standard, published by Anthropic and now maintained by a broad coalition, that lets AI models talk to tools, data sources, and services in a structured, secure, and interoperable way. Think of it as HTTP for AI context: before HTTP, every web server spoke its own dialect; after HTTP, the whole web could talk to each other. MCP does the same thing for the agentic AI layer.
The course runs 53 lessons across 12 Parts, from zero to enterprise. Part I gives you the mental model and the first working server in under an hour. Part XII has you building a full production MCP platform with a registry, an API gateway, and multi-agent orchestration. Everything in between is ordered by dependency – no lesson assumes knowledge that hasn’t been covered yet.
“MCP provides a standardized way for applications to: build composable integrations and workflows, expose tools and capabilities to AI systems, share contextual information with language models.” – Model Context Protocol Specification, Anthropic
All code is in plain Node.js 22 ESM – no TypeScript, no compilation step, no tsconfig to wrestle with. You run node server.js and it works. The point is to teach MCP, not the type system. Where types genuinely help (complex tool schema shapes), JSDoc hints appear inline. Everywhere else, the code is clean signal.
Who This Is For
The course was designed for two audiences who need the same rigour but come at it differently:
University students – third or fourth year CS, AI, or software engineering. You know how to write async JavaScript. You’ve used an LLM API. You want to understand the architecture that makes production agentic systems work, not just the vibes.
Professional engineers and architects – you’re building AI-powered products or evaluating MCP for your organisation. You need the protocol internals, the security model, the enterprise deployment patterns, and a clear comparison of how OpenAI, Anthropic Claude, and Google Gemini each implement the standard differently.
If you’re a beginner to programming, start with the Node.js fundamentals first. If you’re already shipping LLM features to production, you can start from Part IV (provider integrations) and backfill the protocol theory as needed.
Twelve parts. Fifty-three lessons. Ordered strictly by dependency.
The Technology Stack
Every lesson uses the same stack throughout, so you never lose time context-switching:
Runtime: Node.js 22+ with native ESM ("type": "module")
MCP SDK:@modelcontextprotocol/sdk v1 stable (v2 features noted as they ship)
Schema validation:zod v4 for tool input schemas
HTTP transport:@modelcontextprotocol/express or Hono adapter
OpenAI:openai latest – tool calling with GPT-4o and o3
Anthropic:@anthropic-ai/sdk latest – Claude 3.5/3.7 Sonnet
Gemini:@google/generative-ai latest – Gemini 2.0 Flash and 2.5 Pro
Native Node.js extras:--env-file for secrets, node:test for tests
No framework lock-in beyond the MCP SDK itself. All HTTP adapter code works with plain Node.js http if you prefer – the adapter packages are convenience wrappers, not requirements.
Course Curriculum
Fifty-three lessons across twelve parts. Links will go live as each lesson publishes.
The complete stack: Node.js 22 ESM, the MCP SDK, Zod schemas, and all three major LLM providers.
How the Lessons Are Written
Each lesson is designed to be self-contained and longer than comfortable. The goal is that a reader who sits down with the article and a terminal open will finish knowing how to do the thing, not just knowing that the thing exists. That means:
Named failure cases – every lesson covers what goes wrong, specifically, with the exact code that triggers it and the exact fix. Learning from bad examples sticks better than learning from good ones.
Official source quotes – every lesson cites the MCP specification, SDK documentation, or relevant RFC directly. The wording is exact, not paraphrased. The link goes to the actual source document.
Working code – every code block runs. It is tested against the actual SDK version noted at the top of the lesson. Nothing is pseudo-code unless explicitly labelled.
Balance – where a technique has valid alternatives, the lesson says so. A reader should leave knowing when to use the thing taught, and when not to.
“The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals.” – MCP Specification, Protocol Conventions
The course is sourced from over 77 videos across six major MCP playlists from channels including theailanguage, Microsoft Developer, and CampusX – then substantially expanded with code, official spec references, and architectural analysis that the videos don’t cover. The videos are the floor, not the ceiling.
What to Check Right Now
Verify Node.js 22+ – run node --version. If you’re below 22, install via nodejs.org or nvm install 22.
Install yt-dlp (optional, for running the research tooling) – brew install yt-dlp or pip install yt-dlp.
Get API keys before Part IV – OpenAI, Anthropic, and Google AI Studio keys. Store them in .env files, never in code.
Your RAG pipeline is lying to you. Not maliciously, of course, but with the quiet confidence of a student who memorised the textbook’s index but never read a chapter. You feed it documents, it chunks them, embeds them, and when you ask a question it retrieves whichever chunks look “sort of similar” and hopes the LLM can stitch together a coherent answer. Sometimes it works. Sometimes it tells you Tokyo has 36 million people because it averaged two contradictory chunks. And you have no way to know which answer is real, because Vector RAG has no concept of “real”. It only knows “similar”. Context graphs are what happens when you decide similarity is not enough, and you want your AI to actually understand the relationships between things. TrustGraph just shipped a demo that shows exactly what that looks like in practice, and it is worth paying attention to.
Context graphs: where every node knows its neighbours and can prove where it got its information.
What Context Graphs Actually Are (and Why They Are Not Just Knowledge Graphs With a Rebrand)
A context graph is a knowledge graph that has been specifically engineered for consumption by AI models. That sentence sounds like marketing, so let us unpack it. A traditional knowledge graph stores millions of entities and relationships, optimised for human querying and data warehousing. Brilliant for analysts running SPARQL queries. Terrible for an LLM with a context window that starts forgetting things after a few thousand tokens.
Context graphs solve this by dynamically extracting focused subgraphs based on query relevance. Instead of dumping the entire graph into the prompt, you extract only the entities and relationships that matter for this specific question, scored by relevance, annotated with provenance, and formatted to minimise token waste. TrustGraph’s own documentation claims a 70% token reduction in their structured-versus-prose comparison. That number is plausible for the specific example they show (a simple entity lookup), but it is a vendor benchmark, not an independent evaluation, and the savings will vary dramatically depending on query complexity, graph density, and how much context the LLM actually needs.
“Context graphs are knowledge graphs specifically engineered and optimized for consumption by AI models. They extend traditional knowledge graphs by incorporating AI-specific optimizations like token efficiency, relevance ranking, provenance tracking, and hallucination reduction.” — TrustGraph, Context Graphs Guide
Think of the distinction this way. A knowledge graph is your entire library. A context graph is the specific stack of books your librarian pulls when you ask a particular question, each one bookmarked at the relevant page, with a note explaining why it was selected. The librarian remembers which shelf each book came from, when it was last updated, and how confident she is that the information is still correct. That is what provenance tracking and relevance scoring give you.
Here is the structural difference in compact form:
// Traditional knowledge graph: everything, all at once
{
entities: [/* millions */],
relationships: [/* tens of millions */]
}
// Context graph: query-specific, AI-optimised
{
query: "Who leads TechCorp?",
entities: [
{ name: "Alice Johnson", role: "CEO", relevance: 0.95 },
{ name: "TechCorp", industry: "Enterprise Software", relevance: 0.92 }
],
relationships: [
{ from: "Alice Johnson", to: "TechCorp", type: "leads", relevance: 0.90 }
],
metadata: { tokensUsed: 350, confidenceScore: 0.94, sources: ["hr_database"] }
}
The verbose natural-language equivalent of that context graph would cost 150 tokens. The structured version costs 45. Same information, a third of the price. As Martin Kleppmann writes in Designing Data-Intensive Applications, the way you structure your data determines what questions you can efficiently answer. Context graphs are structured specifically to answer LLM questions efficiently.
The TrustGraph Demo: London Pubs, Craft Beer, and Why Semantics Matter
The video “Context Graphs in Action” by TrustGraph co-founders Daniel Davis and Mark Adams is a 27-minute live demo. No slides. No marketing deck. They built a context graph from data about London pubs, restaurants, and event spaces, then demonstrated something deceptively simple that reveals the entire value proposition of this technology.
They asked two questions that any human would consider identical:
“Where can I drink craft beer?”
“Can you recommend a pub which serves craft beer?”
Both questions returned the same answer. But when they expanded the explainability trace, the paths through the graph were completely different. The first question, being open-ended, pulled in concepts from beer gardens, festivals, events, bars, cafes, and dozens of other venue types. The second question, with the word “pub” constraining the search, produced a far narrower traversal. The grounding concepts were different. The subgraph was different. The reasoning path was different. Only the final answer happened to converge.
This is the central insight the demo drives home: two questions that feel identical to a human are semantically distinct to a machine, and context graphs let you see exactly how and why. As Daniel puts it with characteristic bluntness: “If you ask a stupid question, you might get a stupid response.” The explainability trace lets you work backwards from a bad answer and determine whether the fault lay with the query, the data, or the retrieval path.
Same answer, wildly different reasoning paths. The explainability trace reveals what Vector RAG hides.
What the Workbench Actually Shows
The demo walks through TrustGraph’s Workbench interface (accessible at localhost:8888 after deployment). Here is what they demonstrated:
Document ingestion: Plain text and PDF documents about London venues are uploaded through the Library page and processed through a GraphRAG flow. TrustGraph chunks the documents, extracts entities and relationships, generates vector embeddings, and builds the knowledge graph automatically.
Vector search entry points: Searching for “Bermondsey” returns semantically similar terms. Clicking a result reveals the fabric of the graph: Bermondsey tube station connects to the Jubilee line, which has a type “transport line”. You can navigate relationships in 3D space.
3D graph visualisation: Interactive three-dimensional exploration of graph nodes and edges. Not intended for end users (Daniel jokes it would “send everybody over the edge insane”), but invaluable for understanding graph structure during development.
Explainability traces: Every query records a full reasoning trace. You can see: the original query, which concepts were extracted, which graph nodes matched, which edges were traversed, why each piece of evidence was selected (with the LLM’s reasoning), and the final synthesis. All traceable back to source documents.
Source provenance: Every fact in the graph links back to the specific document chunk it was extracted from. You can verify: where did this information come from? When was it ingested? Is it out of date? Do we trust this source?
The Ontology Question
Mark Adams demonstrates both approaches: schema-free extraction (GraphRAG) where the LLM discovers relationships freeform, and ontology-driven extraction (OntologyRAG) where a predefined schema forces precision. For the London venues demo, the ontology defines classes like “atmosphere” (cozy, creative, community spirit), “city”, “neighbourhood”, “event”, and constrains the relationships the graph will accept.
The result with ontologies is significantly more precise. Without an ontology, the LLM sometimes creates duplicate relationships with different names for the same concept. With an ontology, you control the vocabulary, and precision goes up. As Mark explains: “We force it into a much more precise structure.”
TrustGraph sits firmly in the RDF ecosystem rather than the property graph world (Neo4j and similar). The rationale: RDF supports reification (attaching metadata to edges themselves), multi-language representations, and the OWL/SKOS ontology standards natively. These features are essential for explainability and provenance tracking.
But let us be honest about the trade-offs. RDF comes with real costs. SPARQL is notoriously harder to learn than Cypher (Neo4j’s query language). OWL ontologies require domain experts to design and maintain, and they become a governance burden as your data evolves. Property graphs with Neo4j or Memgraph are simpler to reason about, faster for most traversal patterns, and have much larger developer ecosystems. TrustGraph’s choice of RDF is defensible for provenance-heavy enterprise use cases, but it is not the only valid architecture, and for many teams a property graph with LangGraph or LlamaIndex’s knowledge graph module will be simpler to operate and good enough.
The Broader Landscape: TrustGraph Did Not Invent This
Before we go further, some necessary context. The idea of using knowledge graphs to ground LLM responses is not new, and “context graph” is not a category that TrustGraph created from scratch. It is a refined evolution of work that has been shipping in production since late 2024.
Microsoft GraphRAG published the foundational “From Local to Global” paper in April 2024, introducing community-based summarisation of knowledge graphs for query-focused retrieval. Their approach extracts entities and relationships, clusters them into hierarchical communities using the Leiden algorithm, then pre-generates summaries at each level. It is open source, integrates with Neo4j, and has an Azure solution accelerator. Microsoft also shipped LazyGraphRAG (November 2024) to address the cost problem, and BenchmarkQED (June 2025) for automated RAG evaluation.
Neo4j + LangChain/LangGraph is arguably the most widely deployed graph RAG stack in production today. Neo4j’s property graph model with Cypher queries is simpler to learn than SPARQL, has a massive developer community, and integrates directly with LangChain’s retrieval chains. For teams already running Neo4j, adding graph-enhanced RAG requires no new infrastructure.
LlamaIndex Knowledge Graphs provides a Python-native graph RAG pipeline that works with Neo4j, Nebula Graph, and others. It handles entity extraction, graph construction, and hybrid vector+graph retrieval with significantly less operational complexity than a full RDF stack.
What TrustGraph adds to this landscape is specifically the combination of RDF-native ontology support, built-in explainability traces, portable context cores, and multi-model storage (Cassandra, Qdrant, etc.) in a single open-source platform. These are genuine differentiators for provenance-heavy enterprise use cases. But if you do not need ontology enforcement or full reasoning traces, the simpler alternatives above will get you 80% of the benefit at 20% of the operational complexity.
Where Vector RAG Falls Apart (and Context Graphs Save You)
Vector RAG seemed like the answer to everything when embeddings first became cheap. Embed your documents, find similar chunks, feed them to the LLM. Fast, simple, works for demos. Then you deploy it in production and discover the failure modes.
Case 1: The Averaging Problem
You embed two documents. One says “Tokyo’s population is 37.4 million.” The other says “Tokyo has about 35 million people.” Both are semantically similar to the query “What is Tokyo’s population?” The LLM sees both chunks and generates something in between. Maybe 36 million. Confidently wrong.
// Vector RAG retrieval for "What is Tokyo's population?"
chunk_1: "Tokyo's population is 37.4 million" (similarity: 0.94)
chunk_2: "Tokyo has about 35 million people" (similarity: 0.92)
// LLM output: "Tokyo has approximately 36 million people" -- wrong
// Context graph retrieval
node: Tokyo { population: 37400000, source: "UN World Population Prospects 2024",
confidence: 1.0, lastVerified: "2024-07-01" }
// LLM output: "Tokyo's population is 37.4 million" -- correct, sourced, verifiable
A graph stores one value. The correct value. With a source and a timestamp. No ambiguity, no averaging, no hallucination.
Case 2: The Multi-Hop Blindness
Ask Vector RAG: “How does climate change affect AI research funding?” It needs to traverse: climate change affects government priorities, which influence research funding allocation, which supports AI research. Each of those facts lives in a different document. Vector RAG retrieves chunks that are individually similar to the question but cannot connect them into a reasoning chain.
// Vector RAG: retrieves 3 chunks that mention some of these concepts
// but cannot chain: climate -> govt priorities -> funding -> AI research
// Result: vague, hedge-filled answer
// GraphRAG: traverses the reasoning path
climate_change --[affects]--> government_priorities
government_priorities --[influences]--> research_funding
research_funding --[supports]--> ai_research
// Result: specific, grounded answer with full provenance chain
Independent benchmarks from Iterathon’s 2026 enterprise guide report GraphRAG achieving 83-87% accuracy on complex multi-hop queries versus Vector RAG’s 68-72%. Microsoft’s own evaluation found GraphRAG improved comprehensiveness by 26% and diversity by 57% over standard vector retrieval. These numbers are promising, but a caveat: most published benchmarks come from vendors or researchers with a stake in the outcome. Independent, apples-to-apples comparisons across Microsoft GraphRAG, Neo4j + LangChain, LlamaIndex, and TrustGraph on the same dataset remain conspicuously absent from the literature.
Case 3: The Lost-in-the-Middle Catastrophe
Here is the one that should worry every engineer relying on long context windows as a substitute for proper retrieval. Research by Liu et al. at Stanford demonstrated that LLMs consistently fail to use information placed in the middle of long contexts, even when the context window is enormous.
TrustGraph’s own testing confirms this pattern holds across models. Chunks of 1,000 tokens extracted 2,153 graph edges. Chunks of 8,000 tokens extracted only 1,352. That is a 59% increase in extracted knowledge just from chunking smaller, using only 4% of the available context window. At 500 tokens, the system extracted 2,975 edges, a 120% improvement over 8,000-token chunks. This pattern held across eight models from six providers: Claude, Gemini, Mistral, Cohere, Llama, and others.
Long context windows do not work. Not because the models are bad, but because the transformer attention mechanism dilutes focus as token count rises. This appears to be inherent to the architecture itself. Context graphs sidestep the problem entirely: instead of cramming everything into a massive context, you extract a small, focused, structured subgraph. The LLM gets exactly what it needs and nothing else.
Bigger context windows, worse extraction. The lost-in-the-middle effect is real and it is not getting better.
How to Actually Deploy This: From Zero to Context Graph
TrustGraph is open source (Apache 2.0) and deploys via Docker Compose in minutes. Here is the real pipeline, not the marketing version:
Step 1: Configure and Deploy
# Install and configure TrustGraph
npx @trustgraph/config
# Interactive prompts:
# ? Select your LLM provider: Anthropic / OpenAI / Google / Mistral / Ollama
# ? Select deployment target: Docker / Kubernetes / Minikube
# Generates docker-compose.yaml and INSTALLATION.md
# Deploy
docker compose up -d
# Workbench available at http://localhost:8888
# Grafana monitoring at http://localhost:3000
# GraphRAG query with full provenance
tg-invoke-graph-rag \
-f security-graphrag \
-C company-docs \
-q "What are our top cybersecurity vulnerabilities?"
# Or via the REST API
curl -X POST http://localhost:8001/api/invoke/graph-rag \
-H "Content-Type: application/json" \
-d '{
"flow-id": "security-graphrag",
"collection": "company-docs",
"query": "What are our top cybersecurity vulnerabilities?",
"max-entities": 50,
"relevance-threshold": 0.7,
"include-provenance": true
}'
The TypeScript client library (@trustgraph/client) provides WebSocket-based real-time communication for building production UIs. Python and CLI interfaces are also available.
Step 4: Add Ontologies for Precision (Optional but Recommended)
The Unglamorous Reality: What Graph RAG Actually Costs You
Every GraphRAG vendor demo shows the happy path. Here is what they leave out.
Ingestion Is Expensive and Slow
Building a knowledge graph requires running every document chunk through an LLM for entity and relationship extraction. This is not free. Microsoft’s original GraphRAG architecture dedicates roughly 75% of total indexing cost to graph extraction alone. One production deployment reported $33,000 in indexing costs for a large dataset before a single query was run. A 10,000-document corpus that costs under $5 to embed in a vector database costs $50-200 to process through a GraphRAG pipeline. For context: that is a 10-40x cost multiplier at ingestion time.
Entity Resolution Is the Silent Killer
When your LLM extracts entities from thousands of documents, it will create duplicates. “IBM”, “International Business Machines”, “IBM Corp”, and “Big Blue” are all the same entity. If your entity resolution accuracy drops below roughly 85%, the errors compound exponentially through multi-hop queries. At 85% accuracy with 5 hops, fewer than half your answers remain trustworthy (0.85^5 = 44%). This is not a theoretical problem; it is the most common failure mode in production GraphRAG systems, and neither TrustGraph nor anyone else has fully solved it.
Ontology Maintenance Is a Governance Burden
TrustGraph’s OntologyRAG produces more precise graphs, no question. But someone has to design that ontology, maintain it as your domain evolves, and ensure new documents conform to the schema. In practice, this means a dedicated knowledge engineer or a committee that reviews and updates the ontology quarterly. For organisations that already struggle to maintain a data dictionary, adding OWL ontology governance is a non-trivial ask.
Three Indexes, Three Consistency Problems
Production graph RAG requires keeping three synchronized stores: a graph index for structural traversal, a vector index for semantic similarity, and often a text index for full-text search. Every document addition, update, or deletion must propagate across all three and trigger entity resolution re-evaluation. This is, bluntly, a data engineering nightmare that most demos conveniently skip.
Extraction Hallucinations Are Real
The LLM that extracts entities and relationships from your documents will hallucinate some of them. It will invent relationships that do not exist in the source text, misattribute properties, and occasionally create phantom entities. These extraction hallucinations then become “facts” in your knowledge graph, where they are retrieved with the same confidence score as legitimate data. Garbage in, graph out. Every production deployment needs a quality assurance pipeline to catch extraction errors, and most teams underestimate this effort.
Query Latency Is Not Milliseconds
Vector search returns results in single-digit milliseconds. Graph RAG queries involve: vector lookup to find entry points, graph traversal across multiple hops, LLM-based relevance scoring of candidate edges, subgraph assembly, and finally LLM generation. End-to-end latency is typically 2-15 seconds depending on graph size and traversal depth. For interactive applications where users expect sub-second responses, this is a hard constraint that no amount of clever engineering fully eliminates.
When Context Graphs Are Essential (Real Use Cases)
Context graphs are not a universal hammer. They are a precision instrument for specific categories of problem. Here is where they earn their keep:
Financial compliance and audit: A financial analyst querying regulatory exposure across multiple counterparties needs multi-hop reasoning across hundreds of documents. Every answer must be traceable to source documents for regulatory compliance. SowFin, a corporate finance company, uses TrustGraph to bring accurate, explainable insights to corporate finance.
Security operations: Huntbase uses TrustGraph to build Context Cores for SecOps, where AI hallucinations in threat detection are not just inconvenient but dangerous. Cybersecurity requires connecting events, metadata, and threat indicators across thousands of log entries with full provenance.
Medical and clinical research: Clinical informaticists analysing treatment interactions across patient comorbidities need graph traversal to connect drugs, conditions, contraindications, and outcomes across multiple clinical databases. Approximate similarity search is not acceptable when lives are involved.
Supply chain management: Tracing component dependencies multiple tiers deep requires genuine relationship traversal. “Which suppliers are affected if factory X in Shenzhen shuts down?” demands multi-hop graph queries that Vector RAG simply cannot do.
Legal document analysis: Connecting clauses across contracts, precedents across cases, and regulations across jurisdictions. Every connection must be verifiable and traceable.
Enterprise knowledge management: The “monograph” approach (a single unified graph across all your organisation’s knowledge) enables discovery of relationships across departments and domains that siloed systems miss. This is not unique to TrustGraph; any sufficiently connected knowledge graph achieves this, whether built with Neo4j, Microsoft GraphRAG, or TrustGraph.
When Context Graphs Are Overkill (Be Honest With Yourself)
Now for the part that most GraphRAG vendors would rather you did not read. Context graphs are genuinely overkill for a significant number of common AI use cases. Using one when you do not need one is like hiring a structural engineer to hang a picture frame.
Small datasets that fit in context: If your entire corpus is under 50 pages (roughly 40,000 tokens), skip RAG entirely. Stuff it all into the prompt. It costs $0.01 per query versus $0.05 for a RAG pipeline, deploys in a day versus four weeks, and the LLM can attend to all of it directly. No chunking, no embeddings, no graph. Simple prompt engineering wins.
General knowledge queries: Questions the LLM already knows the answer to (world history, common programming patterns, basic science) gain nothing from RAG. You are adding latency without improving accuracy.
Simple semantic lookup: “Find me documents similar to this one.” A vector store alone is faster, cheaper, and simpler. You do not need graph traversal for similarity search.
Ephemeral data with unstable entities: If your corpus changes hourly and the entities and relationships are not stable enough to maintain, the cost of continuous knowledge extraction will exceed the value. A vector store with frequent re-indexing may be more practical.
Speed-critical applications: Vector RAG delivers millisecond responses. GraphRAG takes seconds, sometimes minutes for complex traversals. If sub-100ms latency is a hard requirement, graphs add unacceptable overhead.
Prototyping and MVPs: Vector RAG takes hours to set up. A full knowledge graph pipeline takes weeks. For a proof of concept, start with Vector RAG and upgrade to GraphRAG only when you have evidence that relationship-aware retrieval would improve your results.
Single-fact lookup: “What is the capital of France?” Both approaches achieve 94-95% accuracy on simple factual queries. The graph adds no value here.
The honest decision matrix: if your questions require understanding relationships between entities, connecting information across multiple documents, or producing explainable, auditable answers, you need a graph. But “need a graph” does not mean “need TrustGraph specifically”. A Neo4j instance with LangChain retrieval chains, Microsoft GraphRAG with community summaries, or LlamaIndex’s knowledge graph module may be simpler to deploy, cheaper to run, and sufficient for your use case. Evaluate the alternatives before committing to the heaviest solution. And if your data fits in a context window, you might not need RAG at all.
The Neuro-Symbolic Promise (and Why This Actually Matters)
Daniel Davis makes a point in the demo that deserves its own section. The deep learning camp believed that enough data and compute would magically produce ground truth. Throw enough parameters at the problem and the model would learn to reason. The neuro-symbolic camp argued you would always need richer semantic structures because language is fundamentally ambiguous, and statistical pattern matching cannot resolve that ambiguity alone.
Context graphs are the practical vindication of the neuro-symbolic position. The LLM handles what it is good at: understanding natural language queries, interpreting intent, generating fluent responses. The graph handles what it is good at: storing precise facts, maintaining relationships, providing provenance, enabling deterministic traversal. Neither can solve the full problem alone. Together they produce something that neither approach could achieve independently.
This division of labour, as described in the TrustGraph demo, is not just a technical architecture decision. It is a philosophical one about what AI systems should and should not be trusted to do. LLMs should generate language. They should not be trusted as databases. Graphs should store and retrieve facts. They should not be expected to understand natural language. Each doing what it does best: that is the future of reliable AI systems.
Other Resources Worth Watching
The TrustGraph video is one perspective in a rapidly maturing field. These resources provide alternative viewpoints and competing approaches:
“What is a Context Graph?” — TrustGraph’s conceptual explainer on where knowledge graphs end and context graphs begin.
“The 2025 State of RAG” — Daniel Davis and Kirk Marple revisit their 2024 predictions and make forecasts for 2026.
“Practical GraphRAG” by Michael Hunger — turning text data into graph structures with LLMs, entity extraction, and community clustering.
“Knowledge Graphs & GraphRAG” by Zach Blumenthal — practical graph design patterns and retrieval strategies for GenAI.
Audit your current RAG pipeline’s failure modes. Ask it multi-hop questions that require connecting information across documents. If it fails or hallucinates, you have a graph-shaped problem.
Test the “same question, different words” scenario. Ask semantically equivalent questions and compare outputs. If the answers diverge wildly, your retrieval layer lacks semantic understanding.
Measure your chunk sizes. If you are chunking above 1,000 tokens, you are likely losing information to the lost-in-the-middle effect. Consider chunking at 500-1,000 tokens regardless of your context window size.
Evaluate whether you actually need a graph. Run the honest assessment: does your use case require multi-hop reasoning, explainability, or relationship traversal? If not, a well-tuned Vector RAG pipeline might be all you need.
Try TrustGraph locally. Run npx @trustgraph/config, choose Docker, and docker compose up -d. Load a few documents and explore the Workbench. You can have a working context graph in under an hour. It is free and open source (Apache 2.0).
Check your explainability requirements. If you are building for regulated industries (finance, healthcare, legal), ask whether you can trace every AI-generated answer back to its source documents. If the answer is no, context graphs are not optional, they are mandatory.
The real question is not whether context graphs are useful. It is whether your use case demands them.
Video Attribution
This article is based on the TrustGraph demo “Context Graphs in Action” by Daniel Davis and Mark Adams. The video demonstrates TrustGraph 2’s context graph capabilities, explainability features, and source provenance using a London venues dataset. No marketing, no hype, just a real demo of real context graphs.
Google just published a compression algorithm so efficient that it sent memory chip stocks tumbling across three continents in a single trading session. SK Hynix down 6%. Samsung down 5%. Micron bleeding for six days straight. Billions of dollars in market capitalisation evaporated because a team of researchers figured out a cleverer way to point at things. That is not a metaphor. That is literally what they did. Welcome to TurboQuant, the algorithm that halves the cost of running every large language model on the planet, and the wildest part is that Google just gave it away for free.
TurboQuant: pointing instead of giving directions
What the KV Cache Actually Is (And Why Everyone Should Care)
Before we get into what Google built, you need to understand the bottleneck they solved. Every large language model, whether it is ChatGPT, Claude, Gemini, or Llama, runs on the transformer architecture. And transformers have this mechanism called attention, which is how the model figures out what words mean in context.
Here is a quick thought experiment. If I say “it was tired,” you have no idea what “it” refers to. A dog? A server? A metaphor for the state of modern JavaScript? But if I say “the animal didn’t cross the street because it was too tired,” suddenly “it” is loaded with meaning. It is an animal. It didn’t cross. It was tired. Your brain just did what transformers do: it looked at the surrounding words to figure out what one word actually means.
The problem is that transformers need to remember these relationships. Every time the model processes a token, it calculates how that token relates to every other token it has seen so far. These relationships get stored in what is called the key-value cache (KV cache). Think of it as a filing cabinet. Each “folder” has a label on the front (the key, which is a rough tag so the model can find it quickly) and detailed notes inside (the value, which is the actual rich meaning and relationships).
The catch? This filing cabinet grows linearly with context length. A 128K context window means 128,000 tokens worth of folders, each containing high-dimensional vectors stored at 16-bit precision. For a model like Llama 3.1 with 8 billion parameters, the KV cache alone can eat several gigabytes of GPU memory. For larger models with longer contexts, it becomes the single biggest memory bottleneck in the entire inference pipeline. Not the model weights. Not the activations. The KV cache.
“Vector quantization is a powerful, classical data compression technique that reduces the size of high-dimensional vectors. This optimization addresses two critical facets of AI: it enhances vector search […] and it helps unclog key-value cache bottlenecks by reducing the size of key-value pairs.” — Google Research, TurboQuant Blog Post (March 2026)
Traditional approaches to compressing the KV cache use something called quantisation, which reduces the precision of the stored numbers. Instead of 16 bits per value, you use 8 bits, or 4 bits. The problem is that most quantisation methods need to store calibration constants (a zero point and a scale factor) for every small block of data. These constants have to be stored at full precision, which adds 1-2 extra bits per number. You are trying to compress, but your compression metadata is eating into your savings. It is like buying a wallet so expensive it defeats the purpose of saving money.
PolarQuant: The Art of Pointing Instead of Giving Directions
This is where Google’s insight gets genuinely elegant. Imagine you are standing in a city and someone asks you how to get to an office on the third floor of a building two blocks east and three blocks north. The standard approach is step-by-step Cartesian directions: go two blocks east, then three blocks north, then up three floors. Each dimension gets its own coordinate.
But there is another way. You could just point at the building and say “it is 500 feet away in that direction.” One angle, one distance. Same destination, less information to store.
That is PolarQuant. Instead of storing each dimension of a vector independently (the Cartesian way), it converts the vector into polar coordinates: a radius (how strong or important the data is) and an angle (what direction it points in, which encodes its meaning).
“Instead of looking at a memory vector using standard coordinates that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates […] This is comparable to replacing ‘Go 3 blocks East, 4 blocks North’ with ‘Go 5 blocks total at a 37-degree angle’.” — Google Research, TurboQuant Blog Post
Why is this so much more compressible? Here is the key mathematical insight. When you randomly rotate high-dimensional vectors (which is PolarQuant’s first step), something beautiful happens: the coordinates follow a concentrated Beta distribution. In plain English, the angles cluster tightly into a predictable, narrow range. They are not scattered randomly across all possible values. They bunch up.
This means the model no longer needs to perform expensive data normalisation. Traditional methods map data onto a “square” grid where the boundaries change constantly and need to be recalculated and stored for every block. PolarQuant maps data onto a fixed, predictable “circular” grid where the boundaries are already known. No calibration constants needed. No overhead.
Here is a concrete way to think about it. Imagine you are mapping people on a 2D chart where the X-axis is age and the Y-axis represents some semantic concept. In Cartesian coordinates, you store (x, y) for each person. In polar coordinates, you store (distance from origin, angle). The angle between “grandmother” and “grandfather” is predictable. The angle between “boy” and “girl” is predictable. These patterns are exploitable for compression precisely because they are so regular in high dimensions.
// Cartesian: store each dimension independently
// For a d-dimensional vector, you need d values at full precision
const cartesian = { x: 3.14159, y: 2.71828, z: 1.41421 };
// Plus quantisation overhead: zero_point + scale per block
// Adds 1-2 extra bits per value
// Polar (PolarQuant): store radius + angles
// After random rotation, angles are tightly concentrated
// No calibration constants needed
const polar = { radius: 4.358, angle_1: 0.7137, angle_2: 0.3927 };
// The angles live in a predictable, narrow range
// Quantise directly onto a fixed grid -- zero overhead
From step-by-step directions to a single compass bearing
QJL: The 1-Bit Error Checker That Makes It Lossless
PolarQuant does the heavy lifting. It is responsible for the bulk of the compression. But no compression is perfect, and PolarQuant leaves behind a tiny residual error. This is where the second component comes in, and it is arguably just as clever.
The Quantised Johnson-Lindenstrauss (QJL) algorithm takes the small error left over from PolarQuant and squashes it down to a single sign bit per value: +1 or -1. That is it. One bit. The technique is based on the Johnson-Lindenstrauss lemma, a foundational result in dimensionality reduction that says you can project high-dimensional data into a much lower-dimensional space whilst preserving the distances between points.
What QJL does specifically is eliminate bias in the inner product estimation. This is critical because attention scores in transformers are computed as inner products (dot products) between query and key vectors. If your compression introduces a systematic bias in these dot products, the model’s attention mechanism starts paying attention to the wrong things. It is like having a compass that is consistently off by 3 degrees; every direction you follow drifts further from where you actually want to go.
QJL uses a special estimator that balances a high-precision query vector against the low-precision compressed data. The result is an unbiased inner product estimate with zero memory overhead. The 1-bit correction is so small it is essentially free to store, but it perfectly cancels out the residual error from PolarQuant.
// Stage 1: PolarQuant (main compression)
// 16-bit KV cache -> ~3 bits per channel
// Does most of the heavy lifting
// Tiny residual error remains
// Stage 2: QJL (error correction)
// Takes the residual from PolarQuant
// Reduces it to 1 sign bit (+1 or -1) per value
// Eliminates bias in attention score computation
// Memory overhead: essentially zero
// Combined: TurboQuant
// 3-bit KV cache with ZERO accuracy loss
// No retraining, no fine-tuning, no calibration
// Just swap it in and the model stays identical
Together, PolarQuant + QJL = TurboQuant. The compression engine and its error checker. The paper proves that TurboQuant achieves distortion rates within a factor of approximately 2.7 of the information-theoretic lower bound, the absolute mathematical limit of how well any quantiser could ever perform. In the language of information theory, this is approaching the Shannon limit. There is not much room left to improve.
Theory is nice, but what actually happened when they tested this on real hardware with real models? Google ran TurboQuant through a gauntlet of benchmarks on open-source models (Gemma, Mistral, Llama) running on NVIDIA H100 GPUs. The results are not incremental. They are a step change.
The Headline Numbers
6x KV cache memory reduction. A cache that previously required 16 bits per value now needs under 3 bits. On a model that was using 6 GB of KV cache memory, you now need roughly 1 GB.
Up to 8x attention speedup. The attention computation (the most expensive part of inference) runs up to 8 times faster on H100 GPUs. This does not mean the entire model is 8x faster, but the bottleneck operation is.
Zero accuracy loss. At 3.5 bits per channel, TurboQuant achieves what the authors call “absolute quality neutrality.” The compressed model produces identical results to the uncompressed model. Even at 2.5 bits per channel, degradation is marginal.
No retraining required. This is not a new model architecture. There is no fine-tuning step, no calibration dataset, no model-specific tuning. You slot TurboQuant into the inference pipeline and the existing model just works better.
Benchmark Breakdown
The team tested across five major long-context benchmarks:
LongBench — question answering, summarisation, code generation across diverse tasks
Needle in a Haystack — finding one specific piece of information buried in massive documents
ZeroSCROLLS — long-document understanding tasks
RULER — synthetic benchmarks that stress-test context window utilisation
L-Eval — comprehensive evaluation of long-context capabilities
Across all of them, TurboQuant achieved perfect downstream results whilst reducing KV cache memory by at least 6x. PolarQuant alone was nearly lossless. With QJL added on top, it became mathematically unbiased.
6x compression, 8x speedup, zero loss. The rare triple.
The Stock Market Bloodbath (And Why Analysts Say Calm Down)
Google published TurboQuant on 24 March 2026. Within 48 hours, billions of dollars had been wiped off memory chip stocks across three continents.
The logic seemed straightforward: if AI models need 6x less memory, companies that make memory chips are going to sell fewer chips. Right?
The Damage Report
SK Hynix (South Korea) — down 6.23%
Samsung (South Korea) — down nearly 5%
Kioxia (Japan) — down nearly 6%
Micron (USA) — down over 20% across six trading sessions
SanDisk (USA) — down 11%
Western Digital (USA) — down 6.7%
Seagate (USA) — down 8.5%
The broader Korean KOSPI index fell as much as 3%. Matthew Prince, CEO of Cloudflare, called it “Google’s DeepSeek moment,” referencing the January 2025 DeepSeek sell-off that wiped nearly a trillion dollars off the Nasdaq.
But here is the thing. Analysts are not panicking. In fact, most of them are telling investors to buy the dip.
Ray Wang, a memory analyst at SemiAnalysis, told CNBC:
“When you address a bottleneck, you are going to help AI hardware to be more capable. And the training model will be more powerful in the future. When the model becomes more powerful, you require better hardware to support it.” — Ray Wang, SemiAnalysis, via CNBC (March 2026)
Ben Barringer, head of technology research at Quilter Cheviot, was even more direct: “Memory stocks have had a very strong run and this is a highly cyclical sector, so investors were already looking for reasons to take profit. The Google Turboquant innovation has added to the pressure, but this is evolutionary, not revolutionary. It does not alter the industry’s long-term demand picture.”
For context, memory stocks had been on an absolute tear before this. Samsung was up nearly 200% over the prior year. SK Hynix and Micron were up over 300%. A correction was arguably overdue, and TurboQuant gave skittish investors the excuse they needed.
Jevons Paradox: Why Efficiency Makes You Use More, Not Less
The most important framework for understanding TurboQuant’s long-term impact is not computer science. It is economics. Specifically, a concept from 1865.
In The Coal Question, economist William Stanley Jevons documented something counterintuitive: when James Watt’s innovations made steam engines dramatically more fuel-efficient, Britain’s coal consumption did not fall. It increased tenfold. The efficiency gains lowered coal’s effective cost, which made it economical for new applications and industries. The per-unit savings were overwhelmed by the explosion in total usage.
This is the Jevons paradox, and it has been playing out in AI with striking precision. Between late 2022 and 2025, the cost of running large language models collapsed roughly a thousandfold. GPT-4-equivalent performance dropped from $20 to $0.40 per million tokens. Did people use fewer tokens? Enterprise generative AI spending skyrocketed from $11.5 billion in 2024 to $37 billion in 2025, a 320% increase. When OpenAI dropped API prices by 10x, API calls grew 100x.
The same pattern will almost certainly play out with TurboQuant. If it suddenly costs half as much to run a Frontier model, companies will not pocket the savings and go home. They will run bigger models, longer contexts, more agents, more concurrent sessions. Workloads that were previously too expensive become viable. The 200K-context analysis that cost too much to justify? Now it makes business sense. The always-on AI assistant that was too expensive to run 24/7? Now it is affordable.
Morgan Stanley’s analysts made exactly this argument, citing Jevons paradox to characterise the long-term impact on storage demand as “neutral to positive.” The market overpriced the short-term headline and underpriced the second-order effects.
What This Means for Anyone Using AI Right Now
Let us get concrete about who benefits and how.
Enterprises Running Models at Scale
If you are an enterprise running large language models in production, TurboQuant translates roughly to a 50% reduction in inference costs. This is not a marginal optimisation. This applies to every prompt, every API call, every chatbot response, every agentic workflow. API calls get cheaper. Faster responses. More requests per second on the same hardware. The ability to run longer context windows without hitting memory limits.
Context Windows Get Bigger on the Same Hardware
If a GPU was maxing out at a certain context length because the KV cache filled the available memory, TurboQuant effectively multiplies the available context by 6x. A model that topped out at 32K tokens on a given GPU could now handle 192K tokens. This is significant for code analysis, legal document review, medical record processing, and any workload where more context means better output.
The Anthropic Mythos Situation
Anthropic’s upcoming Mythos model has been described as “very expensive for us to serve, and will be very expensive for our customers to use.” Early pricing estimates suggest 2-5x the cost of Claude Opus. TurboQuant could meaningfully change that calculus. If inference costs drop by half, a model that was borderline unviable for production use cases suddenly becomes economically defensible. Whether Anthropic adopts TurboQuant specifically or implements similar techniques, the pressure to do so just became enormous.
Individual Power Users
Andrej Karpathy, former Tesla AI lead and OpenAI researcher, recently said in an interview that he gets “nervous when I have subscription left over” because “that just means I haven’t maximised my token throughput.” He now runs multiple AI agents in parallel across separate repository branches, treating token consumption as his primary productivity constraint. NVIDIA CEO Jensen Huang has said he expects employees earning $500,000 to use $250,000 worth of tokens. If TurboQuant halves the cost of those tokens, the effective value of every subscription doubles overnight.
Same hardware, twice the output. The new math of AI inference.
Google’s Quiet Giant Move: Why They Published Instead of Hoarding
There is a pattern here that deserves attention. In 2017, a team at Google published “Attention Is All You Need” by Vaswani et al., the paper that introduced the transformer architecture. That single paper became the foundation for GPT, Claude, Gemini, Llama, Mistral, and essentially every large language model in existence. Most of Google’s competitors are built on Google’s published research.
They did it again with TurboQuant. Google could have kept this internal. They could have quietly deployed it across their infrastructure, pocketed the 50% cost savings on Gemini inference, and used the competitive advantage to undercut everyone else on pricing. That is the standard playbook. But they published it. The paper is on arXiv. The blog post explains the technique in detail. Community implementations appeared on PyPI and GitHub within days.
This is not altruism (Google benefits enormously from being the company that publishes foundational research, and they have the infrastructure to move fastest on their own inventions). But the effect is real. Every company running AI models, every open-source project, every independent developer benefits from this work being public.
As Martin Kleppmann writes in Designing Data-Intensive Applications, the most impactful systems are often the ones that reduce the cost of doing something by an order of magnitude, because they do not just make existing use cases cheaper; they create entirely new categories of application that were previously uneconomical. TurboQuant is precisely that kind of step change.
When TurboQuant Does Not Apply (The Honest Bit)
No article from this site would be credible without the caveats section, so here they are:
Case 1: Training Is Untouched
TurboQuant is an inference optimisation. It compresses the KV cache, which is used during inference (when the model generates responses). It does not reduce the cost of training a model. The multi-billion-dollar GPU clusters that companies like Google, OpenAI, and Meta use to train Frontier models are not affected. Training has its own bottlenecks (gradient accumulation, all-reduce communication, activation memory), and TurboQuant addresses none of them.
Case 2: It Only Compresses the KV Cache
The 6x memory reduction applies specifically to the KV cache, not to the model weights, not to the activations, and not to the total GPU memory usage. For many inference workloads, the KV cache is the dominant memory consumer, especially at long context lengths. But for short prompts on large models, the model weights themselves might be the bottleneck. TurboQuant helps a lot in the first scenario and less in the second.
Case 3: You Still Need GPUs
TurboQuant makes existing hardware more efficient. It does not eliminate the need for GPUs (or TPUs). You still need compute to run models. What changes is how much work each GPU can do. Think of it as improving fuel efficiency in a car: you still need the car, and you still need fuel, but you go further on each tank.
Case 4: The 8x Speedup Is for Attention, Not End-to-End
The headline “8x speedup” refers to the attention computation specifically, not the total inference time. A full model forward pass includes many other operations (feedforward layers, layer norms, embedding lookups). The end-to-end speedup depends on what fraction of total inference time is spent on attention. For long-context workloads, it is a large fraction. For short prompts, less so.
How This Actually Gets Deployed
One of TurboQuant’s strongest properties is how easy it is to adopt. Unlike techniques that require retraining or fine-tuning, TurboQuant is data-oblivious: it works without any dataset-specific preprocessing. The deployment path looks like:
No model changes. The model weights, architecture, and training are all untouched. TurboQuant operates entirely at the inference layer.
Swap the KV cache quantiser. Replace the existing KV cache storage with TurboQuant’s polar coordinate quantisation. This is a software change in the inference engine.
Choose your bit-width. At 3.5 bits per channel, you get zero accuracy loss. At 2.5 bits per channel, you get even more compression with marginal degradation. Pick based on your quality requirements.
Deploy. Run the same prompts, get the same results, use 6x less KV cache memory, and compute attention up to 8x faster.
Community implementations have already appeared. A pip-installable turboquant package is on PyPI. Third-party implementations in MLX (for Apple Silicon) and Triton (for custom GPU kernels) were published within days of the announcement. The official Google code is expected in Q2 2026.
# Community implementation (illustrative)
# pip install turboquant
from turboquant import TurboQuantConfig, apply_turboquant
config = TurboQuantConfig(
bits_per_channel=3.5, # Zero accuracy loss
enable_qjl=True, # Error correction stage
)
# Apply to any HuggingFace model's KV cache
model = apply_turboquant(model, config)
# Inference runs as normal -- same API, same outputs
# But KV cache is now 6x smaller and attention is up to 8x faster
output = model.generate(input_ids, max_new_tokens=512)
What to Check Right Now
Audit your KV cache memory usage. If you are running models in production, profile how much GPU memory your KV cache consumes. If it is a significant fraction of total memory (common for long-context workloads), TurboQuant could give you an immediate and substantial improvement.
Watch for framework integration. Keep an eye on vLLM, TensorRT-LLM, and HuggingFace TGI for native TurboQuant support. Once it lands in these frameworks, adoption becomes a config flag.
Re-evaluate your context length limits. If you capped context length because of memory constraints, TurboQuant may let you lift those caps on existing hardware. Longer context often means better output quality.
Read the actual paper. The TurboQuant paper (arXiv:2504.19874) and the PolarQuant paper (arXiv:2502.02617) are both well-written and surprisingly accessible. The Google Research blog post is an excellent entry point if you want the intuition without the proofs.
Don’t panic-sell memory stocks based on headlines. The Jevons paradox has held true for every major compute efficiency improvement in history. Efficiency does not reduce demand; it creates it. The analysts calling this “evolutionary, not revolutionary” for the memory industry are probably right.
Every MCP session starts the same way: the client calls tools/list, gets back every tool schema your server exposes, and sends the entire payload to the LLM as part of the system context. For a server with 10 tools and concise descriptions, that is a few thousand tokens – barely noticeable. For a real enterprise setup with 5 MCP servers exposing 50-100 tools total, you are burning 50,000-80,000 tokens before the user has typed a single word. That is 40% of a 200K context window, gone to tool definitions alone. This lesson covers how to measure, reduce, and eventually eliminate that tax.
The hidden token tax: tool schemas can consume 40%+ of context before any work begins.
Measuring the Problem
Before optimizing, measure. The MCP Inspector shows you the raw tools/list response. To estimate the token cost, count the JSON payload size: roughly 1 token per 4 characters of JSON.
// measure-tool-tokens.js
// Connect to an MCP server and measure the token cost of its tool schemas
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
const transport = new StdioClientTransport({
command: 'node',
args: ['./your-server.js'],
});
const client = new Client({ name: 'token-measurer', version: '1.0.0' });
await client.connect(transport);
const { tools } = await client.listTools();
const payload = JSON.stringify(tools);
const estimatedTokens = Math.ceil(payload.length / 4);
console.log(`Tools: ${tools.length}`);
console.log(`Payload size: ${(payload.length / 1024).toFixed(1)} KB`);
console.log(`Estimated tokens: ${estimatedTokens.toLocaleString()}`);
console.log(`% of 200K context: ${((estimatedTokens / 200_000) * 100).toFixed(1)}%`);
// Per-tool breakdown
for (const tool of tools) {
const toolJson = JSON.stringify(tool);
console.log(` ${tool.name}: ${Math.ceil(toolJson.length / 4)} tokens`);
}
await client.close();
Real-world numbers from production MCP servers:
MCP Server
Tools
Tokens
% of 200K
MySQL Server
106
54,600
27.3%
GitHub + Slack + Sentry + Grafana + Splunk
~120
~77,200
38.6%
Typical 10-tool custom server
10
~3,000
1.5%
The problem is not a single server – it is the aggregate. Five servers with 20 tools each, at 300 tokens per tool, is 30,000 tokens. Add the system prompt, conversation history, and the reserved output buffer, and you have very little room left for actual work.
Server-Side: Description Economy
The single highest-impact optimization is writing shorter tool descriptions and leaner schemas. Here is the token cost per pattern:
Schema Pattern
Tokens/Tool
Recommendation
Verbose description (200+ words)
~300
Trim to 1-2 sentences
Nested object params (3+ levels)
~180
Flatten to scalar params
Enum with 20+ values
~120
Use string + validate server-side
Concise description (1-2 sentences)
~40
Target this range
// BEFORE: 280+ tokens per tool
server.tool(
'search_orders',
'Search for customer orders in the database. This tool allows you to find orders ' +
'by various criteria including customer email, order status, date range, and product ' +
'category. It returns a paginated list of matching orders with full details including ' +
'line items, shipping status, payment method, and customer information. Use this tool ' +
'when the user asks about their orders, wants to check order status, or needs to find ' +
'a specific purchase. Results are sorted by date descending by default.',
{
filters: z.object({
customer: z.object({
email: z.string().email().optional(),
id: z.string().optional(),
}).optional(),
status: z.enum(['pending', 'processing', 'shipped', 'delivered', 'returned', 'cancelled', 'refunded']).optional(),
date_range: z.object({
start: z.string().optional(),
end: z.string().optional(),
}).optional(),
}),
pagination: z.object({
page: z.number().int().min(1).default(1),
per_page: z.number().int().min(1).max(100).default(20),
}).optional(),
},
handler
);
// AFTER: ~45 tokens per tool (same functionality)
server.tool(
'search_orders',
'Find orders by email, status, or date. Returns order ID, status, total, date.',
{
email: z.string().optional().describe('Customer email'),
status: z.string().optional().describe('Order status filter'),
after: z.string().optional().describe('ISO date lower bound'),
before: z.string().optional().describe('ISO date upper bound'),
limit: z.number().int().max(100).default(20).describe('Max results'),
},
handler
);
Key changes: flattened nested objects to scalar params, replaced the verbose description with one sentence, removed the long enum (validate server-side instead), dropped the pagination wrapper object. Same functionality, 85% fewer tokens.
Server-Side: Tool Consolidation
If you have multiple similar tools, consolidate them into one with a mode or provider parameter:
A real-world case study: consolidating 20 tools into 8 reduced token cost from 14,214 to 5,663 – a 60% reduction with identical functionality.
Anthropic Tool Search: 85% Token Reduction
For servers with many tools, Anthropic offers a client-side solution: tool search with defer_loading. Instead of sending all tool schemas to the LLM upfront, you mark tools as deferred. The LLM sees only a search interface and your server’s instructions. When it needs a tool, it searches the catalog, and only the matching schemas are loaded into context.
This is an Anthropic API feature, not part of the MCP specification itself. It works with the Anthropic Messages API and Claude Code:
// Using the Anthropic Messages API with tool search
// The MCP client gathers tools from multiple servers, then marks them as deferred
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
// Collect tools from MCP servers
const allTools = await mcpClient.listAllTools(); // your multi-server aggregation
// Mark large/infrequent tools as deferred
const toolDefinitions = allTools.map(tool => ({
name: tool.name,
description: tool.description,
input_schema: tool.inputSchema,
// Tools marked defer_loading are NOT sent to the LLM initially
// They are loaded on-demand via tool search
defer_loading: allTools.length > 30 ? true : false,
}));
// Include the tool search tool (Anthropic provides two variants)
toolDefinitions.push({
type: 'tool_search_tool_bm25_20251119', // BM25 natural language search
});
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
tools: toolDefinitions,
messages: [{ role: 'user', content: 'Check the latest PR reviews on the main repo' }],
});
// Claude will:
// 1. See only the tool_search tool + non-deferred tools
// 2. Search for relevant tools using natural language
// 3. Get back 3-5 tool_reference blocks with full schemas
// 4. Call the discovered tools normally
The token impact is dramatic:
Mode
Tokens
Context Preserved
All tools loaded
~77,200
122,800 / 200K
Tool search (defer_loading)
~8,700
191,300 / 200K
Two search variants exist:
tool_search_tool_regex_20251119 – Claude constructs regex patterns to search tool names and descriptions. Fast, precise.
tool_search_tool_bm25_20251119 – Claude uses natural language queries to search. Better for fuzzy matching.
Critical interaction with server instructions: when tool search is active, the LLM does not see individual tool schemas upfront. It sees only your server’s instructions field and the search tool. This makes instructions the primary signal for tool discovery – if your instructions don’t mention a capability, the model may never search for the tools that implement it.
Add a minimal flag to tools/list that returns only names, categories, and one-line summaries (~5K tokens for 106 tools instead of ~55K).
Add a new tools/get_schema method that fetches the full schema for a specific tool on demand (~400 tokens per tool).
Clients send the minimal list to the LLM. When the LLM wants to use a tool, the client fetches its full schema and adds it to context.
Estimated savings: 91% token reduction for the initial tool payload (54,604 tokens to 4,899 tokens for 106 tools). This is a proposal, not yet part of the specification, but it signals the direction the protocol is heading.
A related proposal, SEP-1576, adds $ref deduplication for shared parameter types across tools. If ten tools all take a customer_id parameter with the same schema, the definition is included once and referenced by all ten.
Host-Side: Claude Code Context Budget
Claude Code provides a /context command that shows the exact token breakdown per component. A typical over-provisioned setup looks like this:
System prompt: 3,100 tokens ( 1.5%)
System tools: 12,400 tokens ( 6.2%)
MCP tools: 82,000 tokens (41.0%) <-- THE PROBLEM
Conversation: 45,000 tokens (22.5%)
Reserved output: 45,000 tokens (22.5%)
Free space: 12,500 tokens ( 6.3%) <-- NOTHING LEFT
After applying the optimizations from this lesson (lean descriptions, tool consolidation, defer_loading):
System prompt: 3,100 tokens ( 1.5%)
System tools: 12,400 tokens ( 6.2%)
MCP tools: 5,700 tokens ( 2.8%) <-- FIXED
Conversation: 45,000 tokens (22.5%)
Reserved output: 45,000 tokens (22.5%)
Free space: 88,800 tokens (44.4%) <-- ROOM TO WORK
Claude Code v2.1.7+ automatically triggers MCP Tool Search when tool descriptions exceed 10% of context. If you’ve done the server-side optimization, this threshold is rarely hit. If you haven’t, Claude Code compensates by searching on demand – but at the cost of an extra round trip per tool discovery.
Optimization Checklist
Apply these in order of impact:
Trim descriptions to 1-2 sentences – the biggest single win. Tool names carry semantic weight; the description just needs to disambiguate.
You have already learned the six core primitives a server can expose: tools, resources, prompts, sampling, elicitation, and roots. But there is a seventh mechanism that sits above all of them, delivered once during the initialisation handshake, that most MCP developers never implement – and that omission measurably degrades how well an LLM uses their server. That mechanism is server instructions.
The instructions field in the MCP InitializeResult is a plain string that the server returns to the client during the handshake. The client injects it (typically into the system prompt) so the LLM reads it before it sees any tool schemas, resource lists, or user messages. It is the server’s chance to say: “here is the user manual for my tools – which ones to call first, how they relate to each other, what the constraints are, and what you should never do.”
Server instructions flow: server returns instructions in InitializeResult, client injects them into the LLM’s system prompt before any user messages.
Why Individual Tool Descriptions Are Not Enough
Each tool has a description field that explains what it does. But when an LLM gets tools from multiple MCP servers – a GitHub server, a Slack server, a database server, a monitoring server – it needs cross-cutting knowledge that no single tool description can carry. Which tools depend on each other? What order should they be called in? What are the rate limits across the whole server? Which tool should the agent call first to orient itself?
Without instructions, the LLM has to guess these relationships from tool names and descriptions alone. For strong models like Claude Sonnet 4, the guess is often right. For weaker models, the success rate drops dramatically. Instructions close that gap.
“Because server instructions may be injected into the system prompt, they should be written with caution and diligence. No instructions are better than poorly written instructions.” – Ola Servo, MCP Core Maintainer, “Using Server Instructions”
The InitializeResult Schema
The instructions field is part of the InitializeResult that the server returns in step 2 of the handshake. It is optional, and most servers do not set it. Here is the relevant schema from the MCP 2025-11-25 specification:
// From the MCP specification (2025-11-25)
// InitializeResult is the server's response to the client's initialize request
{
"protocolVersion": "2025-11-25",
"serverInfo": {
"name": "my-server",
"version": "1.0.0",
"title": "My Server", // Human-readable display name (new in 2025-06-18)
"description": "Short description" // Optional (new in 2025-11-25)
},
"capabilities": {
"tools": { "listChanged": true },
"resources": { "subscribe": true }
},
"instructions": "Call authenticate first. Then use search_* tools for queries (prefer over list_* to avoid context overflow). Batch operations: max 10 items per call."
}
Setting Instructions in the MCP SDK
In the @modelcontextprotocol/sdk, the instructions field is set in the McpServer constructor. It is part of ServerOptions and gets passed through to the InitializeResult automatically.
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';
const server = new McpServer({
name: 'product-catalog',
version: '1.0.0',
instructions: [
'Product Catalog MCP Server.',
'',
'Start with search_products for any product query (prefer over list_products to avoid context overflow).',
'For bulk operations, use batch_update (max 50 items per call).',
'All prices are in USD cents. Divide by 100 for display.',
'Rate limit: 100 requests/minute per session.',
].join('\n'),
});
// The instructions string is now part of every InitializeResult this server sends
Using .join('\n') on an array of strings keeps long instructions readable in your source code while producing a clean multi-line string for the LLM.
What Good Instructions Cover
The MCP blog and community have converged on five categories that instructions should address:
Cross-tool relationships – “Always call authenticate before any fetch_* tools.” This is the single most valuable thing instructions can do: tell the LLM about dependencies between tools that are invisible from their individual descriptions.
Operational patterns – “Use batch_fetch for multiple items. Check rate_limit_status before bulk operations. Results are cached for 5 minutes.” These are the patterns a human would learn after a week of using the API.
Constraints and limitations – “File operations limited to workspace directory. Rate limit: 100 requests/minute. Maximum payload: 1MB.” Hard limits the model needs to know to avoid errors.
Performance guidance – “Prefer search_* over list_* tools when possible. Process large datasets in batches of 5-10 items.” This prevents the model from making expensive calls that blow up the context window.
Entry point – “Start with get_status to understand the current state before making changes.” Tells the model which tool to call first.
What Instructions Should NOT Contain
Tool descriptions – those belong in tool.description. Duplicating them in instructions wastes tokens.
Marketing or superiority claims – “This is the best server for…” is noise the LLM cannot use.
General behavioral instructions – “Be helpful and concise” is not the server’s job. That belongs in the host’s system prompt.
A manual – instructions should be concise and actionable, not a wall of text. Every token in instructions is a token the LLM reads on every turn.
Real-World Example: GitHub MCP Server
The most well-documented real-world implementation of server instructions is GitHub’s official MCP server (PR #1091, merged September 2025). It uses a pattern worth studying: toolset-based dynamic instructions.
Instead of a single static string, the server generates instructions dynamically based on which toolsets are enabled for the current session:
// Pseudocode of GitHub MCP Server's approach (originally in Go)
// Adapted to JavaScript to match this course
function generateInstructions(enabledToolsets) {
const sections = [];
// Base instruction: always present regardless of which toolsets are active
sections.push(
'GitHub API responses can overflow context windows. Strategy: ' +
'1) Always prefer search_* tools over list_* tools when possible, ' +
'2) Process large datasets in batches of 5-10 items, ' +
'3) For summarization tasks, fetch minimal data first, then drill down.'
);
if (enabledToolsets.includes('pull_requests')) {
sections.push(
'PR review workflow: Always use create_pending_pull_request_review, ' +
'then add_comment_to_pending_review for line-specific comments, ' +
'then submit_pending_pull_request_review. Never use single-step create_and_submit.'
);
}
if (enabledToolsets.includes('issues')) {
sections.push(
'When updating issues, always fetch the current state first with get_issue ' +
'to avoid overwriting recent changes by other contributors.'
);
}
return sections.join(' ');
}
This pattern has three design decisions worth copying:
Always-present base instruction – context management guidance applies regardless of which tools are active.
Conditional sections – only relevant guidance is included. If the PR toolset is disabled, the PR workflow instruction is not sent. This keeps the token cost proportional to the active feature set.
Environment variable escape hatch – setting GITHUB_MCP_DISABLE_INSTRUCTIONS=1 suppresses all instructions for testing.
Measured Impact: +25% Workflow Adherence
The GitHub team ran a controlled evaluation of 40 sessions in VSCode comparing model behavior with and without the PR review workflow instruction. The task: correctly follow the three-step pending review workflow instead of using a single-step shortcut.
Model
With Instructions
Without Instructions
Delta
GPT-5-Mini
80%
20%
+60pp
Claude Sonnet 4
90%
100%
-10pp
Overall
85%
60%
+25pp
The data tells a clear story: strong models (Claude Sonnet 4) naturally gravitate toward the correct workflow even without instructions. Weaker models (GPT-5-Mini) need explicit guidance. Since you cannot control which model your MCP client’s host is running, instructions are insurance that your server works well regardless of model capability.
Client Support and Injection Mechanism
The MCP specification does not mandate how clients use the instructions string. It says the field exists; what the client does with it is implementation-defined. In practice, most clients inject it into the LLM’s system prompt. As of late 2025, these clients support server instructions:
Claude Code – injects instructions into system prompt. Respects them consistently.
VSCode (Copilot Chat) – injects instructions. Used in the GitHub evaluation above.
Goose – injects instructions into system prompt.
Cursor – MCP support shipped in v1.6 (September 2025). Instructions handling may vary.
Because injection is not guaranteed, instructions should enhance, not replace good tool descriptions. If a client ignores instructions, each tool should still be usable from its own description and schema alone. Instructions add the cross-cutting context that individual descriptions cannot carry.
Instructions as the Endorsed Grouping Mechanism
A common request from MCP server developers is tool grouping or namespacing – a way to tell the LLM “these five tools belong together.” The MCP specification does not have a formal grouping primitive. Instead, the endorsed mechanism is the instructions field.
“Lots of people want tool bundling / grouping / namespaces to guide servers how to use tools together. We should make instructions more obvious and have examples for how to use it.” – Felix Weinberger, MCP contributor, Python SDK Issue #1464
This means if you want to group your tools into logical sets, you do it in instructions:
const server = new McpServer({
name: 'analytics-server',
version: '2.0.0',
instructions: [
'Analytics MCP Server - two tool groups:',
'',
'QUERYING: Use run_query for SQL, get_dashboard for pre-built views,',
'export_csv for bulk data. Always run_query before export_csv.',
'',
'ADMIN: Use create_dashboard to build new views, set_alert for thresholds.',
'Admin tools require prior authentication via the OAuth flow.',
].join('\n'),
});
A Complete Server With Instructions
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import { z } from 'zod';
const server = new McpServer({
name: 'customer-support',
version: '1.0.0',
instructions: [
'Customer Support MCP Server.',
'',
'Workflow: 1) lookup_customer by email or ID, 2) get_tickets for that customer,',
'3) respond_to_ticket or escalate_ticket based on severity.',
'',
'Never call respond_to_ticket without first reading the ticket via get_tickets.',
'Escalation threshold: severity >= 3 or customer tier = "enterprise".',
'Rate limit: 60 requests/minute. Batch lookups with lookup_customers (max 20).',
].join('\n'),
});
server.tool(
'lookup_customer',
'Find a customer by email address or customer ID. Returns customer profile with tier and contact info.',
{
email: z.string().email().optional().describe('Customer email address'),
customer_id: z.string().optional().describe('Customer ID (format: CUS-XXXX)'),
},
async ({ email, customer_id }) => {
if (!email && !customer_id) {
return { isError: true, content: [{ type: 'text', text: 'Provide email or customer_id.' }] };
}
const customer = await db.findCustomer({ email, customer_id });
if (!customer) {
return { isError: true, content: [{ type: 'text', text: 'Customer not found.' }] };
}
return { content: [{ type: 'text', text: JSON.stringify(customer) }] };
}
);
server.tool(
'get_tickets',
'List support tickets for a customer. Returns ticket ID, subject, severity (1-5), status, and last update.',
{
customer_id: z.string().describe('Customer ID (format: CUS-XXXX)'),
status: z.enum(['open', 'pending', 'closed']).optional().default('open')
.describe('Filter by ticket status'),
},
{
annotations: { readOnlyHint: true, openWorldHint: false },
},
async ({ customer_id, status }) => {
const tickets = await db.getTickets(customer_id, status);
return { content: [{ type: 'text', text: JSON.stringify(tickets) }] };
}
);
server.tool(
'respond_to_ticket',
'Send a response to a support ticket. The response is visible to the customer.',
{
ticket_id: z.string().describe('Ticket ID (format: TKT-XXXX)'),
message: z.string().min(1).max(5000).describe('Response message to send to the customer'),
},
{
annotations: { destructiveHint: false, readOnlyHint: false, openWorldHint: true },
},
async ({ ticket_id, message }) => {
await db.addTicketResponse(ticket_id, message);
return { content: [{ type: 'text', text: `Response sent to ${ticket_id}.` }] };
}
);
server.tool(
'escalate_ticket',
'Escalate a ticket to a human agent. Use when severity >= 3 or customer tier is enterprise.',
{
ticket_id: z.string().describe('Ticket ID (format: TKT-XXXX)'),
reason: z.string().describe('Why this ticket needs human attention'),
},
{
annotations: { destructiveHint: false, readOnlyHint: false, openWorldHint: true },
},
async ({ ticket_id, reason }) => {
await db.escalateTicket(ticket_id, reason);
return { content: [{ type: 'text', text: `Ticket ${ticket_id} escalated. Reason: ${reason}` }] };
}
);
const transport = new StdioServerTransport();
await server.connect(transport);
Notice how the instructions string tells the LLM the workflow order (lookup, then get tickets, then respond or escalate), the escalation rule (severity >= 3 or enterprise tier), and the rate limit. None of these facts belong in any single tool’s description – they are cross-cutting concerns that only instructions can carry.
Instructions and Tool Search
As MCP servers grow to dozens of tools, clients like Claude Code adopt tool search mechanisms (covered in detail in a later lesson). When tool search is active, the LLM does not see all tool schemas upfront – it sees only the instructions and a search interface. The instructions become the primary signal the model uses to decide which tools to search for.
This makes instructions even more critical for large servers: if your instructions do not mention a capability, the model may never discover the tools that implement it.
What to Check Right Now
Add instructions to your server – even a two-sentence string describing the workflow order and the main constraint is better than no instructions at all.
Keep it under 200 words – instructions are read on every LLM turn. Every word costs tokens across every interaction.
Test with a weaker model – your instructions matter most for models that cannot infer tool relationships from names alone. Test with GPT-4o-mini or a smaller model to verify the instructions actually help.
Do not duplicate tool descriptions – instructions describe relationships and constraints. Individual tool capabilities belong in tool.description.
Use the MCP Inspector – run npx @modelcontextprotocol/inspector node your-server.js and verify that the instructions appear in the InitializeResult.
This final capstone assembles everything from the course into a complete MCP platform: a registry for server discovery, an API gateway for authentication and routing, a collection of domain-specific MCP servers, and a web interface where teams can explore available tools, run agent queries, and review audit logs. When you deploy this platform, you have the infrastructure that enterprise teams need to build and manage AI-powered workflows on MCP.
The complete MCP platform: registry, gateway, domain servers, and a management web interface.
Platform Architecture Overview
Component
Purpose
Lesson Reference
MCP Registry
Server discovery and health tracking
Lesson 44
API Gateway
Auth (OAuth), rate limiting, routing
Lessons 31, 41
Domain MCP Servers
Business tools (CRM, docs, analytics)
Parts I-III
Multi-Provider Agent
Route queries to OpenAI/Claude/Gemini
Lessons 28-30
Audit Service
Structured logs, compliance reporting
Lesson 35
Observability Stack
Prometheus + Grafana + OpenTelemetry
Lesson 42
Management UI
Tool explorer, query interface, logs
This lesson
Every row in this table maps to a lesson you have already completed. The capstone’s job is not to teach new concepts but to show how they compose into a real system. In production, these components run as separate services that communicate over HTTP and message queues, so a failure in analytics does not bring down the gateway or registry.
Platform Bootstrap Script
// platform/bootstrap.js
// Register all MCP servers with the registry on startup
const REGISTRY_URL = process.env.REGISTRY_URL ?? 'http://localhost:4000';
const MCP_SERVERS = [
{
id: 'products',
name: 'Product Catalog Server',
description: 'Search, browse, and manage product catalog',
url: process.env.PRODUCTS_SERVER_URL,
tags: ['products', 'catalog', 'inventory'],
auth: { type: 'bearer' },
healthUrl: `${process.env.PRODUCTS_SERVER_URL}/health`,
},
{
id: 'analytics',
name: 'Analytics Server',
description: 'Business metrics, trends, and reports',
url: process.env.ANALYTICS_SERVER_URL,
tags: ['analytics', 'metrics', 'reports'],
auth: { type: 'bearer' },
healthUrl: `${process.env.ANALYTICS_SERVER_URL}/health`,
},
// ... more servers
];
async function registerAll() {
for (const server of MCP_SERVERS) {
await fetch(`${REGISTRY_URL}/servers`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(server),
});
console.log(`Registered: ${server.name}`);
}
}
await registerAll();
Registry-driven discovery is what makes this platform extensible. When a new team wants to expose their internal API as an MCP server, they register it here and it becomes automatically available to the agent and the management UI. No code changes, no redeployment of the gateway – just a single POST to the registry endpoint.
Management API
// platform/management-api.js
// REST API for the management UI
import express from 'express';
const app = express();
app.use(express.json());
// List all registered MCP servers with health
app.get('/api/platform/servers', async (req, res) => {
const response = await fetch(`${REGISTRY_URL}/status`);
res.json(await response.json());
});
// List all tools from all healthy servers
app.get('/api/platform/tools', async (req, res) => {
const discovery = new McpDiscoveryClient(REGISTRY_URL);
await discovery.connect();
const tools = await discovery.getAllTools();
res.json({ tools, count: tools.length });
});
// Execute an agent query
app.post('/api/platform/query', async (req, res) => {
const { question, provider = 'auto', userId } = req.body;
// Rate limit, auth check, then:
const agent = await createAgent({ scope: getUserScope(userId), preferredProvider: provider });
const answer = await agent.run(question);
res.json({ answer });
await agent.close();
});
// Get audit logs for a user
app.get('/api/platform/audit', async (req, res) => {
const { userId, from, to, limit = 50 } = req.query;
const logs = await auditDb.query({ userId, from, to, limit });
res.json({ logs });
});
app.listen(5000, () => console.log('Management API on :5000'));
Component interaction: the discovery client queries the registry, builds the tool set, and routes through the agent.
One risk in a distributed platform like this: if the registry goes down, no new agent sessions can discover tools. The management API’s /tools endpoint depends on a live registry connection. In production, cache the last-known server list in the gateway so it can continue serving requests even during a brief registry outage.
The audit endpoint at /api/platform/audit is what compliance teams will query most frequently. It lets managers review what their team asked the AI, which tools it called, and whether any requests failed. Without this, AI assistants become a black box that security teams will rightly refuse to approve.
Eight services in a single Compose file. This is a realistic local development setup, but for production you would break these into separate deployment units – the gateway and domain servers behind a load balancer, Prometheus and Grafana in a dedicated monitoring namespace, and the registry behind its own high-availability cluster.
What You Have Built
Across all 53 lessons and 5 capstone projects you have built:
MCP servers using every primitive: tools, resources, prompts, sampling, elicitation, roots
Clients for all three major LLM providers: OpenAI, Claude, and Gemini
Production infrastructure: Docker, Kubernetes, Nginx, Redis
A complete enterprise platform: registry, gateway, domain servers, management UI
MCP is the connective tissue of the AI application stack. You now know it from protocol fundamentals to enterprise deployment. Go build something important.
This capstone builds the most complete MCP application in the course: an enterprise AI assistant with OAuth 2.0 authentication, RBAC tool access control, full audit logging, rate limiting, and a multi-provider backend. It brings together patterns from every major part of the course into a single deployable system. Deploy it and you have a production-ready enterprise AI assistant that your security team can audit and your compliance team can sign off on.
Request lifecycle: every request goes through 6 stages before the agent runs.
The six-stage pipeline (authenticate, rate limit, resolve roles, validate input, run agent, audit) is the same request lifecycle used by production API gateways at companies like Stripe and Shopify. Each stage can reject the request independently, and the audit log captures the outcome regardless of success or failure. This is what compliance teams actually review during security audits.
Notice that the agent is created fresh per request and closed in the finally block. This prevents one user’s MCP session state from leaking into another user’s query. It costs a bit more in connection overhead, but the isolation guarantee is worth it for a multi-tenant system.
A misconfigured RBAC map is one of the most dangerous bugs in this system. If you accidentally give the employee role actions:write scope, every employee can trigger alerts and modify tickets through the AI assistant. Always test your scope mapping with unit tests, and consider adding a “dry run” mode that logs what a user would be allowed to do without actually executing anything.
The multi-provider router gives you vendor resilience. If OpenAI has an outage, you can fall back to Claude or Gemini without changing any application code. In practice, teams also use this pattern for cost optimization – routing simple queries to cheaper models and complex analytical questions to more capable ones.
The Docker Compose file gives you a single docker compose up to launch the entire stack locally. Redis handles both rate limiting state and session caching. For production, you would swap the single Redis container for a managed service (like AWS ElastiCache or GCP Memorystore) and add TLS termination in front of the gateway.
Real-world AI assistants need to integrate many APIs: a CRM for customer data, a ticketing system for support requests, a payment processor for billing status, a calendar for scheduling. Each of these becomes an MCP server, and the multi-provider abstraction layer from Lesson 29 routes queries to the right provider. This capstone builds a multi-API integration hub that unifies five real-world APIs behind a single MCP interface, with tool routing, error handling, and a unified context window.
Five MCP servers, one agent: the hub aggregates tools from all servers and routes calls automatically.
The key architectural decision here is one agent, many servers. Each API gets its own MCP server process, which means they are isolated – a crash in the payments server does not take down the CRM. It also means you can develop, test, and deploy each server independently, exactly like microservices.
The Multi-Server Agent
// agent/hub-agent.js
import OpenAI from 'openai';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
const SERVER_CONFIGS = [
{ id: 'crm', command: 'node', args: ['./servers/crm-server.js'] },
{ id: 'tickets', command: 'node', args: ['./servers/tickets-server.js'] },
{ id: 'payments', command: 'node', args: ['./servers/payments-server.js'] },
{ id: 'calendar', command: 'node', args: ['./servers/calendar-server.js'] },
{ id: 'analytics', command: 'node', args: ['./servers/analytics-server.js'] },
];
export async function createHubAgent() {
const openai = new OpenAI();
const connections = new Map();
const allTools = [];
// Connect to all servers in parallel
await Promise.all(SERVER_CONFIGS.map(async config => {
const transport = new StdioClientTransport({ command: config.command, args: config.args, env: process.env });
const client = new Client({ name: 'hub-agent', version: '1.0.0' });
await client.connect(transport);
connections.set(config.id, client);
const { tools } = await client.listTools();
for (const tool of tools) {
allTools.push({
serverId: config.id,
tool,
openaiFormat: {
type: 'function',
function: { name: tool.name, description: `[${config.id}] ${tool.description}`, parameters: tool.inputSchema, strict: true },
},
});
}
}));
console.log(`Hub connected to ${connections.size} servers, ${allTools.length} tools total`);
// Find which server owns a tool
const toolIndex = new Map(allTools.map(t => [t.tool.name, t]));
return {
async query(userMessage) {
const messages = [
{
role: 'system',
content: `You are a comprehensive business assistant with access to CRM, ticketing, payments, calendar, and analytics systems.
Tools are prefixed with their system: [crm], [tickets], [payments], [calendar], [analytics].
When answering questions, use tools from multiple systems as needed to give a complete answer.
Always check multiple related systems when investigating customer issues.`,
},
{ role: 'user', content: userMessage },
];
const openaiTools = allTools.map(t => t.openaiFormat);
let turns = 0;
while (true) {
const response = await openai.chat.completions.create({
model: 'gpt-4o', messages, tools: openaiTools, tool_choice: 'auto',
parallel_tool_calls: true,
});
const msg = response.choices[0].message;
messages.push(msg);
if (msg.finish_reason !== 'tool_calls') return msg.content;
if (++turns > 15) throw new Error('Max turns exceeded');
const results = await Promise.all(msg.tool_calls.map(async tc => {
const entry = toolIndex.get(tc.function.name);
if (!entry) {
return { role: 'tool', tool_call_id: tc.id, content: `Tool '${tc.function.name}' not found` };
}
const client = connections.get(entry.serverId);
const args = JSON.parse(tc.function.arguments);
const result = await client.callTool({ name: tc.function.name, arguments: args });
const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
return { role: 'tool', tool_call_id: tc.id, content: text };
}));
messages.push(...results);
}
},
async close() {
await Promise.all([...connections.values()].map(c => c.close()));
},
};
}
Parallel tool calling: GPT-4o queries CRM, tickets, and payments simultaneously for a complete customer view.
The parallel_tool_calls: true flag is critical for performance. Without it, the model would call CRM, wait for the response, then call tickets, wait again, then call payments. With parallel calls, all three fire simultaneously and the total latency is the slowest server, not the sum of all servers. For customer-facing support bots, this can cut response time from 6 seconds to 2.
One thing that can go wrong here: tool name collisions. If both the CRM server and the tickets server expose a tool called search, the toolIndex map will silently overwrite one with the other. The description prefix ([crm], [tickets]) helps the model distinguish them, but the routing map needs unique names. Namespace your tool names (like crm_search, tickets_search) to avoid this.
const agent = await createHubAgent();
const answer = await agent.query(
'Customer john.smith@acme.com says their subscription renewal failed last week. ' +
'What is their account status, do they have any open support tickets, ' +
'and what does their payment history look like?'
);
// Agent will call: search_customers, get_subscription, list_tickets, get_payment_history
// in parallel, then synthesize a complete answer
console.log(answer);
await agent.close();
This hub pattern is how enterprise support platforms like Zendesk and Intercom are building their AI assistants. A single user question like “why was this customer charged twice?” requires data from billing, CRM, and ticketing systems simultaneously. Without MCP’s standardized tool interface, you would need custom integration code for every API combination.
This capstone builds a filesystem agent powered by Claude 3.7 Sonnet. The agent can read files, search codebases, analyze code structure, and refactor files under user supervision. It applies the security patterns from Part VIII: roots for filesystem boundaries, tool safety for path validation, and confirmation-based elicitation for destructive file writes. The result is a safe, auditable codebase assistant that you can trust with your actual project files.
Filesystem agent: Claude plans file operations, MCP server executes them within roots-defined boundaries.
Four filesystem tools with layered safety: roots validation, path sanitization, and elicitation for writes.
The layered validation here is worth studying. The Zod schema rejects path traversal (..) at the input level, validatePath enforces the roots boundary, and the write_file tool adds elicitation as a final gate. Each layer catches different attack vectors: malicious input, logic bugs, and unintended overwrites. Removing any single layer would leave a real gap.
If no roots are configured, every operation fails immediately. This is a deliberate fail-closed design. In production, you never want a misconfiguration to silently grant full filesystem access – it is far safer to break loudly than to expose /etc/passwd because someone forgot to set the project root.
The Claude Filesystem Agent
// agent/fs-agent.js
import Anthropic from '@anthropic-ai/sdk';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
const anthropic = new Anthropic();
export async function createFilesystemAgent(projectRoot) {
const transport = new StdioClientTransport({
command: 'node',
args: ['./servers/fs-server.js'],
env: { ...process.env },
});
const mcp = new Client({
name: 'fs-agent',
version: '1.0.0',
capabilities: { roots: { listChanged: true } }, // Declare roots support
});
await mcp.connect(transport);
// Set the allowed root to the project directory
// (roots are set by the client, enforced by the server)
console.log(`Filesystem agent initialized. Root: ${projectRoot}`);
const { tools: mcpTools } = await mcp.listTools();
const tools = mcpTools.map(t => ({
name: t.name, description: t.description, input_schema: t.inputSchema,
}));
return {
async ask(question) {
const messages = [{ role: 'user', content: question }];
let turns = 0;
while (true) {
const response = await anthropic.messages.create({
model: 'claude-3-7-sonnet-20250219',
max_tokens: 4096,
system: `You are a codebase assistant. The project root is ${projectRoot}.
Use read_file to examine files, list_directory to explore structure, search_files to find code.
Only use write_file when explicitly asked to modify files.`,
messages,
tools,
});
messages.push({ role: 'assistant', content: response.content });
if (response.stop_reason !== 'tool_use') {
return response.content.filter(b => b.type === 'text').map(b => b.text).join('');
}
if (++turns > 15) throw new Error('Max turns exceeded');
const toolResults = await Promise.all(
response.content.filter(b => b.type === 'tool_use').map(async block => {
const result = await mcp.callTool({ name: block.name, arguments: block.input });
const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
return { type: 'tool_result', tool_use_id: block.id, content: text };
})
);
messages.push({ role: 'user', content: toolResults });
}
},
async close() { await mcp.close(); },
};
}
This agent pattern is the same one powering tools like Cursor, Windsurf, and Claude Code. A model reads your files, understands the structure, and proposes edits – but the human confirms destructive writes. The elicitation step in write_file is what separates a helpful assistant from a dangerous one.
One subtle risk: the search_files tool returns up to 50 matches, and large codebases could easily produce hundreds. If the model receives all 50 results in a single tool response, that burns a significant chunk of the context window. Consider adding pagination or relevance ranking if you deploy this against a large repository.
What to Extend
Add a run_tests tool that executes node --test and returns the output – the agent can then read failing test files and suggest fixes.
This capstone project builds a complete, production-ready PostgreSQL query agent using OpenAI GPT-4o and MCP. By the end you will have a fully functional system where a user can ask questions in natural language and the agent translates them to safe, parameterized SQL queries, executes them against a real PostgreSQL database, formats the results, and explains its reasoning. The project incorporates lessons from throughout the course: schema validation, tool safety, audit logging, retry logic, and graceful shutdown.
The database query agent: user asks a question, GPT-4o plans SQL queries, MCP tools execute them safely.
Four tools: schema discovery (list, describe), safe query execution, and row counting for query planning.
In practice, this four-tool design is intentional: it mirrors how a careful human analyst works. Rather than handing the model a single “run any SQL” tool, you force it through a discovery workflow – list tables, inspect columns, then query. This staged approach dramatically reduces hallucinated column names and malformed joins because the model sees the real schema before writing SQL.
Watch the safety check in execute_query closely. The keyword blocklist approach is simple but brittle – a query like SELECT * FROM updates would be rejected because “UPDATE” appears in the table name. In a production system, you would use a proper SQL parser or run queries through a read-only database user instead of string matching.
The OpenAI Query Agent
// agent/query-agent.js
import OpenAI from 'openai';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
const openai = new OpenAI();
export async function createQueryAgent() {
const transport = new StdioClientTransport({
command: 'node',
args: ['./servers/db-server.js'],
env: { ...process.env },
});
const mcp = new Client({ name: 'query-agent', version: '1.0.0' });
await mcp.connect(transport);
const { tools: mcpTools } = await mcp.listTools();
const tools = mcpTools.map(t => ({
type: 'function',
function: { name: t.name, description: t.description, parameters: t.inputSchema, strict: true },
}));
const SYSTEM_PROMPT = `You are a precise database analyst.
You have access to a PostgreSQL database. To answer questions:
1. First call list_tables to see available tables
2. Call describe_table for tables relevant to the question
3. Plan a safe SELECT query (use parameters for any user values)
4. Call execute_query with the query and parameters
5. Present results clearly with a brief interpretation
Always use parameterized queries. Never build SQL by string concatenation.
If a question cannot be answered with a SELECT, say so clearly.`;
return {
async query(userQuestion) {
const messages = [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: userQuestion },
];
let turns = 0;
while (true) {
const response = await openai.chat.completions.create({
model: 'gpt-4o', messages, tools, tool_choice: 'auto',
});
const msg = response.choices[0].message;
messages.push(msg);
if (msg.finish_reason !== 'tool_calls') {
return msg.content;
}
if (++turns > 10) throw new Error('Max turns exceeded');
const results = await Promise.all(msg.tool_calls.map(async tc => {
const args = JSON.parse(tc.function.arguments);
const result = await mcp.callTool({ name: tc.function.name, arguments: args });
const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
return { role: 'tool', tool_call_id: tc.id, content: text };
}));
messages.push(...results);
}
},
async close() { await mcp.close(); },
};
}
The agent loop here follows the same pattern you have seen throughout the course, but notice the turn cap of 10. Without it, a confusing question could cause the model to loop indefinitely – calling tools, misinterpreting results, and calling more tools. In a billing-sensitive environment, a runaway loop like that translates directly into unexpected API costs.
Teams commonly deploy this exact pattern as an internal analytics bot on Slack or Teams. A support engineer asks “how many orders shipped last week from warehouse 3?” and gets an answer in seconds, without needing SQL skills or database access credentials. The read-only constraint means the bot is safe to hand to non-technical staff.
What to Extend
Add the audit logging middleware from Lesson 35 to log every execute_query call with the SQL, user, and result count.
Add a sample_rows tool that returns 3 rows from any table – helps the model understand data format before writing queries.
Connect it to your real production database with a read-only service account.
The MCP SDK ships with two built-in transports: stdio and Streamable HTTP. These cover the vast majority of use cases. But sometimes you need something different: an in-process transport for testing, a WebSocket transport for browser environments, an IPC transport for Electron apps, or a transport that encrypts the JSON-RPC stream at the application layer. The SDK’s transport interface is deliberately minimal, making it straightforward to implement your own. This lesson covers the interface, two reference implementations, and practical extension points.
The Transport interface is three methods: start, send, and close. Any communication channel can become an MCP transport.
The Transport Interface
// The MCP SDK Transport interface (TypeScript definition for reference)
// interface Transport {
// start(): Promise;
// send(message: JSONRPCMessage): Promise;
// close(): Promise;
// onmessage?: (message: JSONRPCMessage) => void;
// onerror?: (error: Error) => void;
// onclose?: () => void;
// }
// In JavaScript, implement the same shape:
class CustomTransport {
onmessage = null; // Called when a message is received
onerror = null; // Called on transport errors
onclose = null; // Called when the transport closes
async start() {
// Initialize the underlying communication channel
}
async send(message) {
// Send a JSONRPCMessage object
}
async close() {
// Clean up the channel
}
}
The interface is intentionally minimal: three async methods and three event callbacks. This simplicity is the point. Any communication channel that can send and receive JSON objects – WebSockets, Unix domain sockets, shared memory, even a pair of browser MessageChannels – can become an MCP transport by implementing these six members.
In-Process Transport for Testing
An in-process transport connects a client directly to a server in the same Node.js process. Essential for integration tests without spawning subprocesses:
// in-process-transport.js
export function createInProcessTransport() {
let clientTransport, serverTransport;
clientTransport = {
onmessage: null, onerror: null, onclose: null,
async start() {},
async send(msg) {
// Route to server
if (serverTransport.onmessage) serverTransport.onmessage(msg);
},
async close() {
if (clientTransport.onclose) clientTransport.onclose();
if (serverTransport.onclose) serverTransport.onclose();
},
};
serverTransport = {
onmessage: null, onerror: null, onclose: null,
async start() {},
async send(msg) {
// Route to client
if (clientTransport.onmessage) clientTransport.onmessage(msg);
},
async close() {
if (clientTransport.onclose) clientTransport.onclose();
if (serverTransport.onclose) serverTransport.onclose();
},
};
return { clientTransport, serverTransport };
}
// Usage in tests:
import { test } from 'node:test';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { createInProcessTransport } from './in-process-transport.js';
test('in-process round trip', async (t) => {
const { clientTransport, serverTransport } = createInProcessTransport();
const server = buildServer();
const client = new Client({ name: 'test', version: '1.0.0' });
await server.connect(serverTransport);
await client.connect(clientTransport);
const { tools } = await client.listTools();
assert.ok(tools.length > 0);
await client.close();
});
This in-process transport eliminates the main pain point of MCP integration tests: subprocess management. No ports to allocate, no processes to spawn and kill, no race conditions between server startup and client connection. Tests using this pattern typically run 10-50x faster than their subprocess equivalents.
In-process transport: no network, no subprocess, instant round trip – ideal for unit and integration testing.
WebSocket Transport
npm install ws
// websocket-transport.js - client side
import WebSocket from 'ws';
export class WebSocketClientTransport {
#url;
#ws = null;
onmessage = null;
onerror = null;
onclose = null;
constructor(url) {
this.#url = url;
}
async start() {
return new Promise((resolve, reject) => {
this.#ws = new WebSocket(this.#url);
this.#ws.once('open', resolve);
this.#ws.once('error', reject);
this.#ws.on('message', (data) => {
try {
const msg = JSON.parse(data.toString());
if (this.onmessage) this.onmessage(msg);
} catch (err) {
if (this.onerror) this.onerror(err);
}
});
this.#ws.on('close', () => {
if (this.onclose) this.onclose();
});
this.#ws.on('error', (err) => {
if (this.onerror) this.onerror(err);
});
});
}
async send(message) {
this.#ws.send(JSON.stringify(message));
}
async close() {
this.#ws?.close();
}
}
// WebSocket server transport
export class WebSocketServerTransport {
#socket;
onmessage = null;
onerror = null;
onclose = null;
constructor(socket) {
this.#socket = socket;
socket.on('message', (data) => {
try {
const msg = JSON.parse(data.toString());
if (this.onmessage) this.onmessage(msg);
} catch (err) {
if (this.onerror) this.onerror(err);
}
});
socket.on('close', () => {
if (this.onclose) this.onclose();
});
}
async start() {}
async send(message) {
this.#socket.send(JSON.stringify(message));
}
async close() {
this.#socket.close();
}
}
// Server side: wrap ws.WebSocketServer
import { WebSocketServer } from 'ws';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
const wss = new WebSocketServer({ port: 9000 });
wss.on('connection', async (socket) => {
const transport = new WebSocketServerTransport(socket);
const server = buildMcpServer();
await server.connect(transport);
});
WebSocket transport is the natural choice when your MCP client runs in a browser. Unlike Streamable HTTP, which requires the client to open new connections for each request, a WebSocket keeps a single persistent bidirectional channel open. The trade-off is that WebSocket connections are harder to load-balance (no standard sticky-session mechanism) and are not part of the official MCP spec, so you take on compatibility risk.
Protocol Extensions: Custom Methods
Beyond custom transports, MCP’s JSON-RPC foundation lets you add entirely new methods outside the spec. Prefixing them with your company namespace (like com.mycompany/) avoids collisions with future spec additions. This is useful for operational tooling – metrics, health checks, debug endpoints – that your internal clients need but that do not belong in the standard tool/resource model.
// MCP allows custom methods beyond the spec - they are prefixed with your namespace
// Use this for proprietary extensions that are specific to your deployment
// Server side: handle a custom method
server.server.setRequestHandler(
{ method: 'com.mycompany/getServerMetrics' },
async (request) => {
return {
uptime: process.uptime(),
activeSessions: sessionStore.size,
memoryMB: Math.round(process.memoryUsage().heapUsed / 1024 / 1024),
};
}
);
// Client side: call a custom method
const metrics = await client.request(
{ method: 'com.mycompany/getServerMetrics', params: {} },
/* ResultSchema */ undefined
);
One thing to watch out for with custom methods: they are invisible to standard MCP clients. If you add com.mycompany/getServerMetrics, only clients you control will know it exists. Standard MCP clients will not discover or call these methods via listTools, since they are not tools. Use them for internal operational purposes, not for functionality you expect third-party clients to use.
The extensions Capability Field
New in Draft – This feature is in the Draft spec and may be finalised in a future revision.
The Draft specification adds an extensions field to both ClientCapabilities and ServerCapabilities. This provides a standardised place to advertise optional protocol extensions beyond the core spec, replacing the ad-hoc approach of custom methods and namespaced capabilities.
// Server declaring support for a custom extension during initialization
{
capabilities: {
tools: {},
resources: {},
extensions: {
'com.mycompany/streaming-progress': {
version: '1.0.0',
},
'com.mycompany/team-collaboration': {
version: '2.1.0',
},
},
},
}
// Client checking for extension support
const serverCaps = client.getServerCapabilities();
if (serverCaps?.extensions?.['com.mycompany/streaming-progress']) {
// Enable the streaming progress UI
}
The extensions field gives custom methods a discoverable surface. Instead of blindly calling com.mycompany/getServerMetrics and hoping it exists, a client can check capabilities.extensions during initialisation and adapt its behaviour. Namespace your extensions with a reverse-domain prefix (like Java packages) to avoid collisions with future spec additions or other vendors.
What to Build Next
Replace subprocess spawning in your integration tests with the in-process transport. Measure the test speedup.
If you have a browser-based MCP client, implement the WebSocket transport and test it against your existing MCP server with a WebSocket adapter.