Lesson 28 of 55: Building a Production Gemini Client for MCP Agents

The previous three Gemini lessons gave you the building blocks: function calling, multimodal inputs, and Vertex AI deployment. This lesson assembles them into a production-grade client library you can drop into any Node.js MCP application. It covers the patterns that only show up after your agent has processed its first ten thousand requests: token budget management, graceful quota handling, automatic retry with jitter, structured response parsing, and observability hooks.

Production Gemini MCP client architecture diagram showing token budget retry circuit breaker observability dark teal
A production Gemini MCP client needs four layers: retry logic, budget enforcement, circuit breaking, and telemetry.

The Base Client Class

// gemini-mcp-client.js
import { GoogleGenerativeAI } from '@google/generative-ai';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

export class GeminiMcpClient {
  #genai;
  #mcp;
  #model;
  #geminiTools;
  #config;

  constructor(config = {}) {
    this.#config = {
      model: config.model ?? 'gemini-2.0-flash',
      maxTokens: config.maxTokens ?? 8192,
      maxTurns: config.maxTurns ?? 10,
      maxRetries: config.maxRetries ?? 3,
      tokenBudget: config.tokenBudget ?? 100_000,
      onTokenUsage: config.onTokenUsage ?? null,
      onToolCall: config.onToolCall ?? null,
      ...config,
    };
    this.#genai = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
    this.#mcp = new Client({ name: 'gemini-prod-host', version: '1.0.0' });
  }

  async connect(serverCommand, serverArgs = []) {
    const transport = new StdioClientTransport({ command: serverCommand, args: serverArgs });
    await this.#mcp.connect(transport);
    const { tools } = await this.#mcp.listTools();
    this.#geminiTools = [{
      functionDeclarations: tools.map(t => ({
        name: t.name,
        description: t.description,
        parameters: t.inputSchema,
      })),
    }];
    this.#model = this.#genai.getGenerativeModel({
      model: this.#config.model,
      tools: this.#geminiTools,
      generationConfig: { maxOutputTokens: this.#config.maxTokens },
    });
  }

  async run(userMessage) {
    const chat = this.#model.startChat();
    return this.#runLoop(chat, userMessage);
  }

  async close() {
    await this.#mcp.close();
  }
}

Wrapping the Gemini SDK and MCP client into a single class gives you a single place to enforce all production concerns: retries, budgets, timeouts, and telemetry. Without this, those concerns leak across your entire codebase and become impossible to test or change consistently.

Retry Logic with Exponential Backoff and Jitter

  // Inside GeminiMcpClient class

  async #sendWithRetry(chat, content) {
    const { maxRetries } = this.#config;
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        return await chat.sendMessage(content);
      } catch (err) {
        const isQuota = err.status === 429 || err.message?.includes('RESOURCE_EXHAUSTED');
        const isServer = err.status >= 500;
        if ((!isQuota && !isServer) || attempt === maxRetries) throw err;

        const base = isQuota ? 5000 : 1000;
        const jitter = Math.random() * 1000;
        const delay = Math.min(base * Math.pow(2, attempt - 1) + jitter, 60_000);
        console.error(`[gemini] Attempt ${attempt} failed (${err.status ?? err.message}), retrying in ${Math.round(delay)}ms`);
        await new Promise(r => setTimeout(r, delay));
      }
    }
  }
Exponential backoff with jitter timing diagram for Gemini 429 quota errors 5 second base delay dark
Quota errors (429) get a 5-second base delay. Server errors (5xx) use a 1-second base. Jitter prevents thundering herd.

Retry logic is where most production Gemini integrations first break. Without jitter, multiple agent instances hitting quota limits at the same time will retry in lockstep, creating a thundering herd that keeps triggering 429 errors. The randomized delay breaks this cycle.

Token Budget Enforcement

  #totalTokensUsed = 0;

  #checkBudget(usage) {
    if (!usage) return;
    const total = usage.totalTokenCount ?? 0;
    this.#totalTokensUsed += total;
    if (this.#config.onTokenUsage) {
      this.#config.onTokenUsage({ total, cumulative: this.#totalTokensUsed });
    }
    if (this.#totalTokensUsed > this.#config.tokenBudget) {
      throw new Error(`Token budget exceeded: ${this.#totalTokensUsed} > ${this.#config.tokenBudget}`);
    }
  }

Token budgets prevent a common production failure: an agent enters a verbose loop, generates thousands of tokens per turn, and blows through your daily budget in minutes. Setting a per-request ceiling catches this early, before a single runaway conversation drains your account.

The Full Run Loop

  async #runLoop(chat, userMessage) {
    let response = await this.#sendWithRetry(chat, userMessage);
    this.#checkBudget(response.response.usageMetadata);
    let candidate = response.response.candidates[0];
    let turns = 0;

    while (candidate.content.parts.some(p => p.functionCall)) {
      if (++turns > this.#config.maxTurns) {
        throw new Error(`Max turns exceeded (${this.#config.maxTurns})`);
      }

      const calls = candidate.content.parts.filter(p => p.functionCall);
      const results = await Promise.all(calls.map(part => this.#executeToolCall(part.functionCall)));

      response = await this.#sendWithRetry(chat, results);
      this.#checkBudget(response.response.usageMetadata);
      candidate = response.response.candidates[0];
    }

    if (candidate.finishReason === 'SAFETY') {
      throw new Error('Response blocked by safety filters');
    }

    return candidate.content.parts.filter(p => p.text).map(p => p.text).join('');
  }

  async #executeToolCall(fc) {
    const start = Date.now();
    if (this.#config.onToolCall) {
      this.#config.onToolCall({ name: fc.name, args: fc.args, phase: 'start' });
    }
    try {
      const result = await this.#mcp.callTool({ name: fc.name, arguments: fc.args });
      const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
      if (this.#config.onToolCall) {
        this.#config.onToolCall({ name: fc.name, durationMs: Date.now() - start, phase: 'done' });
      }
      return { functionResponse: { name: fc.name, response: { result: text } } };
    } catch (err) {
      if (this.#config.onToolCall) {
        this.#config.onToolCall({ name: fc.name, error: err.message, phase: 'error' });
      }
      return { functionResponse: { name: fc.name, response: { error: err.message } } };
    }
  }

Using the Production Client

import { GeminiMcpClient } from './gemini-mcp-client.js';

const client = new GeminiMcpClient({
  model: 'gemini-2.0-flash',
  tokenBudget: 50_000,
  maxTurns: 8,
  onTokenUsage: ({ total, cumulative }) => {
    console.error(`[tokens] +${total} total=${cumulative}`);
  },
  onToolCall: ({ name, durationMs, phase, error }) => {
    if (phase === 'done') console.error(`[tool:${name}] ${durationMs}ms`);
    if (phase === 'error') console.error(`[tool:${name}] ERROR: ${error}`);
  },
});

await client.connect('node', ['./servers/analytics-server.js']);

const answer = await client.run('What were the top 5 products by revenue last month?');
console.log(answer);

await client.close();

The callback hooks (onTokenUsage, onToolCall) are the foundation of your observability stack. In production, you would pipe these events to a metrics service like Datadog or Cloud Monitoring rather than console.error, giving you dashboards for tool latency, token burn rate, and error frequency.

Streaming Responses for Long Outputs

  // Add streaming support to the client
  async runStream(userMessage, onChunk) {
    const chat = this.#model.startChat();
    const stream = await chat.sendMessageStream(userMessage);

    for await (const chunk of stream.stream) {
      const text = chunk.candidates?.[0]?.content?.parts
        ?.filter(p => p.text)
        ?.map(p => p.text)
        ?.join('') ?? '';
      if (text && onChunk) onChunk(text);
    }

    const final = await stream.response;
    this.#checkBudget(final.usageMetadata);
    return final.candidates[0].content.parts.filter(p => p.text).map(p => p.text).join('');
  }

Monitoring Quota Usage

// Track requests-per-minute to avoid hitting quota limits proactively
class RateLimiter {
  #requests = [];
  #windowMs;
  #maxPerWindow;

  constructor(maxPerMinute = 60) {
    this.#windowMs = 60_000;
    this.#maxPerWindow = maxPerMinute;
  }

  async throttle() {
    const now = Date.now();
    this.#requests = this.#requests.filter(t => t > now - this.#windowMs);
    if (this.#requests.length >= this.#maxPerWindow) {
      const oldest = this.#requests[0];
      const wait = this.#windowMs - (now - oldest) + 100;
      await new Promise(r => setTimeout(r, wait));
    }
    this.#requests.push(Date.now());
  }
}

const limiter = new RateLimiter(55);  // 55 RPM leaves headroom
await limiter.throttle();
const answer = await client.run(userMessage);

Gemini vs OpenAI vs Claude: Production Comparison

Aspect Gemini 2.0 Flash GPT-4o Claude 3.5 Sonnet
Parallel tool calls Yes, aggressive Yes, optional Limited (beta)
Context window 1M tokens (Flash/Pro) 128K tokens 200K tokens
Multimodal in same call Yes (text, image, PDF, audio) Yes (text, image) Yes (text, image)
Prompt caching Context caching (Vertex) Automatic (>1K tokens) Explicit cache_control
Schema conversion from MCP Pass through (JSON Schema) Wrap in function object Pass through (JSON Schema)
Stateful session object Chat object Manual messages array or Responses API Manual messages array

This comparison is a snapshot in time. Provider capabilities and pricing shift quarterly. The production-safe approach is to log your actual usage data – tokens, latency, error rates, cost per task type – and revisit your model choices every few months based on real numbers rather than marketing materials.

Failure Modes to Harden Against

  • RESOURCE_EXHAUSTED (429): Use a 5-second base delay with jitter. Gemini’s quota windows are per minute – log RPM metrics to catch spikes before they become errors.
  • Infinite tool loops: Gemini can get into cycles where it calls the same tool repeatedly with slightly different args. The maxTurns guard is essential. Log the tool name + args on each call to detect cycles early.
  • Large context accumulation: Gemini’s Chat session adds every turn to history. For multi-hour agent sessions, this can balloon token costs. Implement a sliding window or summarization strategy at 50K tokens.
  • Safety filter false positives: Check finishReason === 'SAFETY' and handle it distinctly from other errors – do not silently return an empty string to the user.

What to Build Next

  • Extract GeminiMcpClient into a reusable npm package with a clean API and write a node:test suite against it using a mock MCP server.
  • Deploy it to Cloud Run with Vertex AI credentials, Gemini 2.0 Flash, and a Cloud Monitoring dashboard tracking token usage, tool call latency, and error rates.

nJoy πŸ˜‰

Lesson 27 of 55: Google AI Studio, Vertex AI, and MCP Servers for Enterprises

Running Gemini through the free Google AI Studio API is fine for prototypes, but enterprise deployments require what Vertex AI provides: VPC-SC network boundaries, CMEK encryption, IAM-based access control, regional data residency, no prompts used for model training, and SLA-backed uptime. If your MCP server handles customer data, PII, or proprietary IP, Vertex AI is the correct target environment. This lesson covers the transition from AI Studio to Vertex AI and the MCP-specific patterns that differ between the two.

Vertex AI enterprise diagram with VPC security IAM CMEK regional data residency connected to MCP server dark
Vertex AI provides the enterprise security posture that production MCP deployments require.

AI Studio vs Vertex AI: The Key Differences

Feature AI Studio Vertex AI
Auth API key GCP Service Account / ADC
Network isolation Public internet VPC-SC, Private Service Connect
Data used for training May be used Never used
Encryption Google-managed Google-managed or CMEK
Regional control Limited Full (europe-west1, us-east4, etc.)
SLA No SLA 99.9% SLA
Pricing model Pay per token Pay per token + provisioned throughput option

For many teams, the “data used for training” row is the deciding factor. If your MCP tools process customer records, health data, or financial transactions, the guarantee that Vertex AI never trains on your prompts or responses is often a compliance requirement, not just a preference.

Setting Up Vertex AI in Node.js

Vertex AI uses Application Default Credentials (ADC) instead of API keys. In development, authenticate with gcloud auth application-default login. In production, attach a service account to your Compute Engine instance or Cloud Run service.

npm install @google-cloud/vertexai @modelcontextprotocol/sdk
import { VertexAI } from '@google-cloud/vertexai';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const vertex = new VertexAI({
  project: process.env.GCP_PROJECT_ID,
  location: process.env.GCP_REGION ?? 'us-central1',
});

// Connect to MCP server
const transport = new StdioClientTransport({ command: 'node', args: ['./servers/data-server.js'] });
const mcp = new Client({ name: 'vertex-host', version: '1.0.0' });
await mcp.connect(transport);
const { tools: mcpTools } = await mcp.listTools();

// Convert MCP tools to Vertex AI FunctionDeclarations
const vertexTools = [{
  functionDeclarations: mcpTools.map(t => ({
    name: t.name,
    description: t.description,
    parameters: t.inputSchema,
  })),
}];

const model = vertex.preview.getGenerativeModel({
  model: 'gemini-2.0-flash-001',  // Vertex uses versioned model names
  tools: vertexTools,
});

The key difference from AI Studio is that you never handle API keys directly. ADC resolves credentials from the environment – a service account JSON file locally, or Workload Identity on GKE. This eliminates an entire class of secret-management bugs that plague API key-based deployments.

Vertex AI authentication flow diagram showing service account ADC credential chain GCP IAM dark
Vertex AI authentication via ADC: no API keys in your code, credentials come from the GCP environment.

The Tool Calling Loop on Vertex AI

async function runVertexMcpLoop(userMessage) {
  const chat = model.startChat();
  let response = await chat.sendMessage(userMessage);
  let candidate = response.response.candidates[0];

  while (candidate.content.parts.some(p => p.functionCall)) {
    const calls = candidate.content.parts.filter(p => p.functionCall);
    const results = await Promise.all(
      calls.map(async part => {
        const fc = part.functionCall;
        const mcpResult = await mcp.callTool({ name: fc.name, arguments: fc.args });
        const text = mcpResult.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
        return {
          functionResponse: {
            name: fc.name,
            response: { result: text },
          },
        };
      })
    );
    response = await chat.sendMessage(results);
    candidate = response.response.candidates[0];
  }

  return candidate.content.parts.filter(p => p.text).map(p => p.text).join('');
}

The tool calling loop is identical to the AI Studio version. The only differences are the SDK (@google-cloud/vertexai), the auth mechanism (ADC), and the model names (versioned rather than aliased).

This identical loop structure is a deliberate design choice by Google. Teams can prototype with AI Studio (free tier, API key) and then move to Vertex AI (production, IAM) by changing only the SDK import and initialization. Your MCP integration code, tool schemas, and business logic remain untouched.

Grounding with Google Search on Vertex AI

Vertex AI offers Grounding with Google Search – a built-in tool that adds real-time web search to Gemini responses. You can combine this with your custom MCP tools:

const modelWithGrounding = vertex.preview.getGenerativeModel({
  model: 'gemini-2.0-flash-001',
  tools: [
    { googleSearchRetrieval: {} },  // Enable Google Search grounding
    { functionDeclarations: mcpTools.map(t => ({  // And your MCP tools
      name: t.name, description: t.description, parameters: t.inputSchema,
    })) },
  ],
});

Grounding with Google Search is especially powerful for MCP agents that need both internal and external data. Your MCP tools handle proprietary databases and internal APIs, while Google Search fills in real-time public information – stock prices, weather, news, regulatory updates – without building additional tools for each source.

Deploying Your MCP Server Alongside Cloud Run

A common Vertex AI pattern: your MCP server runs as a container on Cloud Run (using Streamable HTTP transport), and your Node.js host service makes MCP calls to it over HTTPS. This pairs well with Vertex AI because both services can use the same VPC connector:

// Cloud Run MCP server URL (deployed with your service)
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamable-http.js';

const transport = new StreamableHTTPClientTransport(
  new URL(process.env.MCP_SERVER_URL)  // e.g., https://my-mcp-server-xyz.run.app/mcp
);

const mcp = new Client({ name: 'vertex-cloud-host', version: '1.0.0' });
await mcp.connect(transport);

Provisioned Throughput for Predictable Latency

Vertex AI’s Provisioned Throughput option pre-allocates model capacity, eliminating the latency spikes that come from shared infrastructure. For MCP agents processing high-value business transactions (order processing, financial analysis, customer support), this is worth the cost:

// Configure provisioned throughput model endpoint
const model = vertex.preview.getGenerativeModel({
  model: 'projects/PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID',  // Provisioned endpoint
  tools: vertexTools,
});

“Vertex AI provides enterprise-grade security and privacy controls, ensuring your data is never used to train Google’s models and stays within your chosen regions.” – Google Cloud, Vertex AI Data Governance

Service Account Least-Privilege Setup

# Create a service account for your MCP host
gcloud iam service-accounts create mcp-vertex-host \
  --display-name="MCP Vertex AI Host"

# Grant only the roles required for Gemini API calls
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:mcp-vertex-host@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

# For Cloud Run, set the service account at deploy time
gcloud run deploy my-mcp-host \
  --service-account=mcp-vertex-host@PROJECT_ID.iam.gserviceaccount.com \
  --region=europe-west1

In a real deployment, you will likely have two service accounts: one for the MCP host (needs aiplatform.user to call Gemini) and one for the MCP server on Cloud Run (needs access to your databases, APIs, and storage). Separating these accounts limits the blast radius if either service is compromised.

Failure Modes Specific to Vertex AI

  • Quota limits per region: Vertex AI quotas are per-region, not global. If you hit limits in us-central1, consider distributing across regions with a simple fallback.
  • ADC credential expiry: Service account tokens expire after 1 hour. The @google-cloud/vertexai SDK handles refresh automatically, but ensure the underlying credential source (Workload Identity, attached service account) is correctly configured.
  • VPC-SC policy blocking API calls: If your MCP server is behind a VPC Service Controls perimeter, ensure aiplatform.googleapis.com is in the allowed services list.
  • Model names are versioned: Unlike AI Studio’s gemini-2.0-flash alias, Vertex uses stable names like gemini-2.0-flash-001. Pin to a version in production to avoid unexpected breaking changes.

What to Build Next

  • Deploy a Cloud Run MCP server with Streamable HTTP transport and connect it to a Vertex AI host. Verify the full flow from user request to tool execution to response.
  • Set up a service account with least-privilege IAM and test that your MCP host can call Vertex AI and your Cloud Run MCP server without any extra roles.

nJoy πŸ˜‰

Lesson 26 of 55: Multimodal MCP – Images, PDFs, and Audio With Google Gemini

Gemini’s multimodal capabilities are not a bolt-on feature — they are a first-class part of the API. When you combine them with MCP, you unlock tool-calling patterns that no other provider can match today: an agent that reads a PDF invoice and simultaneously queries your accounting database; a vision pipeline that processes uploaded product photos and calls your inventory API; an audio transcription workflow that tags clips with taxonomy from your knowledge base. This lesson covers the full multimodal stack for MCP applications.

Gemini multimodal MCP diagram showing image PDF audio inputs flowing into model alongside tool calls dark
Gemini accepts images, PDFs, audio, and video alongside text and tool calls in the same request.

The Multimodal Parts System

Every Gemini request is built from an array of parts. A part can be text, inline data (base64), or a file URI from the Files API. This composability is what makes multimodal tool calling clean:

import { GoogleGenerativeAI } from '@google/generative-ai';
import fs from 'node:fs';
import path from 'node:path';

const genai = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

// Inline image (small files up to ~20MB)
function imageToInlinePart(filePath, mimeType = 'image/jpeg') {
  const data = fs.readFileSync(filePath).toString('base64');
  return { inlineData: { mimeType, data } };
}

// Inline PDF
function pdfToInlinePart(filePath) {
  const data = fs.readFileSync(filePath).toString('base64');
  return { inlineData: { mimeType: 'application/pdf', data } };
}

This parts-based architecture is why Gemini multimodal feels natural to work with. Rather than separate endpoints for vision and text, you compose a single array of parts, mixing modalities freely. When MCP tools are added on top, the model can reason across image content and tool results in the same conversation turn.

Image Analysis + MCP Tool Calls

A common pattern: the user uploads a product photo, the model identifies it visually, then calls an MCP tool to fetch live inventory and pricing data for that product:

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const transport = new StdioClientTransport({
  command: 'node',
  args: ['./servers/inventory-server.js'],
});
const mcp = new Client({ name: 'vision-agent', version: '1.0.0' });
await mcp.connect(transport);
const { tools: mcpTools } = await mcp.listTools();

const model = genai.getGenerativeModel({
  model: 'gemini-2.0-flash',
  tools: [{ functionDeclarations: mcpTools.map(t => ({ name: t.name, description: t.description, parameters: t.inputSchema })) }],
});

async function analyzeProductImage(imagePath) {
  const chat = model.startChat();
  const imagePart = imageToInlinePart(imagePath);

  let response = await chat.sendMessage([
    imagePart,
    { text: 'Identify this product and check its current inventory and pricing using the available tools.' },
  ]);

  let candidate = response.response.candidates[0];

  // Tool calling loop (same pattern as lesson 24)
  while (candidate.content.parts.some(p => p.functionCall)) {
    const calls = candidate.content.parts.filter(p => p.functionCall);
    const results = await Promise.all(calls.map(async part => {
      const fc = part.functionCall;
      const result = await mcp.callTool({ name: fc.name, arguments: fc.args });
      const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
      return { functionResponse: { name: fc.name, response: { result: text } } };
    }));
    response = await chat.sendMessage(results);
    candidate = response.response.candidates[0];
  }

  return candidate.content.parts.filter(p => p.text).map(p => p.text).join('');
}

const analysis = await analyzeProductImage('./uploads/product-photo.jpg');
console.log(analysis);

In practice, this pattern powers use cases like warehouse quality control (photograph a shelf, look up inventory), insurance claims processing (photograph damage, cross-reference policy), and retail product identification. The key insight is that the model handles the visual recognition while MCP tools provide the live data layer.

Gemini PDF invoice analysis with MCP database lookup tool call flow dark diagram boxes arrows
Invoice processing: Gemini reads the PDF, extracts line items, then calls MCP tools to verify each item against your ERP.

PDF Processing with MCP Enrichment

PDFs are particularly powerful. A 200-page contract, a scanned invoice, or a technical specification can be passed to Gemini as inline data. The model reads the entire document and can simultaneously call MCP tools to enrich or validate the extracted information:

async function processInvoice(pdfPath) {
  const model = genai.getGenerativeModel({
    model: 'gemini-2.5-pro-preview-03-25',  // Pro for complex document understanding
    tools: [{ functionDeclarations: mcpTools.map(t => ({ name: t.name, description: t.description, parameters: t.inputSchema })) }],
  });

  const chat = model.startChat();
  const pdfPart = pdfToInlinePart(pdfPath);

  let response = await chat.sendMessage([
    pdfPart,
    {
      text: `Extract all line items from this invoice (product name, SKU, quantity, unit price, total).
For each line item, use the verify_product tool to check if the SKU exists in our system.
Flag any discrepancies between the invoice price and our current pricing.
Return a structured JSON summary.`,
    },
  ]);

  let candidate = response.response.candidates[0];
  while (candidate.content.parts.some(p => p.functionCall)) {
    const calls = candidate.content.parts.filter(p => p.functionCall);
    const results = await Promise.all(calls.map(async part => {
      const fc = part.functionCall;
      const result = await mcp.callTool({ name: fc.name, arguments: fc.args });
      const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
      return { functionResponse: { name: fc.name, response: { result: text } } };
    }));
    response = await chat.sendMessage(results);
    candidate = response.response.candidates[0];
  }

  return candidate.content.parts.filter(p => p.text).map(p => p.text).join('');
}

Be careful with PDF size when using inline base64 encoding. A 50-page PDF might be 5-10MB, but base64 inflates that by roughly 33%. If your invoices or contracts regularly exceed 15MB, skip ahead to the Files API section below to avoid hitting request size limits.

The Files API for Large Files

Files over ~20MB (or videos) should use the Files API rather than inline base64. This also supports re-use across multiple requests without re-uploading:

import { GoogleAIFileManager } from '@google/generative-ai/server';

const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);

// Upload once, use multiple times
const uploadedFile = await fileManager.uploadFile('./large-report.pdf', {
  mimeType: 'application/pdf',
  displayName: 'Q1 2026 Report',
});

console.log(`Uploaded: ${uploadedFile.file.uri}`);

// Wait for processing to complete
let fileState = await fileManager.getFile(uploadedFile.file.name);
while (fileState.state === 'PROCESSING') {
  await new Promise(r => setTimeout(r, 2000));
  fileState = await fileManager.getFile(uploadedFile.file.name);
}

if (fileState.state !== 'ACTIVE') {
  throw new Error(`File processing failed: ${fileState.state}`);
}

// Reference in any model call
const filePart = {
  fileData: {
    mimeType: fileState.mimeType,
    fileUri: fileState.uri,
  },
};

// Now use filePart in chat.sendMessage([filePart, { text: 'Summarize this report...' }])

The Files API also opens up video analysis. You can upload a product demo video, have Gemini analyze visual content frame by frame, and then call MCP tools to log findings or trigger workflows. Audio and video share the same upload-then-reference pattern shown above.

Audio Transcription + Tool Enrichment

async function processAudioWithEnrichment(audioPath) {
  const model = genai.getGenerativeModel({
    model: 'gemini-2.0-flash',
    tools: [{ functionDeclarations: mcpTools.map(t => ({ name: t.name, description: t.description, parameters: t.inputSchema })) }],
  });

  const audioData = fs.readFileSync(audioPath).toString('base64');
  const audioPart = { inlineData: { mimeType: 'audio/mpeg', data: audioData } };

  const chat = model.startChat();
  let response = await chat.sendMessage([
    audioPart,
    { text: 'Transcribe this audio. Then identify any product names or order numbers mentioned and look them up in our system.' },
  ]);

  let candidate = response.response.candidates[0];
  while (candidate.content.parts.some(p => p.functionCall)) {
    const calls = candidate.content.parts.filter(p => p.functionCall);
    const results = await Promise.all(calls.map(async part => {
      const fc = part.functionCall;
      const result = await mcp.callTool({ name: fc.name, arguments: fc.args });
      const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
      return { functionResponse: { name: fc.name, response: { result: text } } };
    }));
    response = await chat.sendMessage(results);
    candidate = response.response.candidates[0];
  }

  return candidate.content.parts.filter(p => p.text).map(p => p.text).join('');
}
Three modality examples image analysis PDF processing audio transcription each with MCP tool call enrichment dark
Image, PDF, and audio inputs all follow the same parts-based pattern – the tool calling loop is identical.

Notice that all three modality examples – image, PDF, and audio – reuse the identical tool-calling loop from the function calling lesson. This is intentional. Multimodal inputs change what the model sees, not how it calls tools. Your MCP server code stays exactly the same regardless of input type.

Multimodal MCP Resources

MCP resources can also return binary content (type 'blob') – useful for image thumbnails, report PDFs, or audio clips. You can fetch these from a resource and pass them directly to Gemini:

// Fetch a binary resource from MCP and pass it to Gemini
const resource = await mcp.readResource({ uri: 'report://monthly/2026-03.pdf' });
const blobContent = resource.contents.find(c => c.mimeType === 'application/pdf');

if (blobContent) {
  const pdfPart = { inlineData: { mimeType: 'application/pdf', data: blobContent.blob } };
  const response = await chat.sendMessage([pdfPart, { text: 'Extract the key metrics from this report.' }]);
}

Audio Content Type in MCP

New in 2025-03-26

MCP added audio as a first-class content type alongside text and image in spec version 2025-03-26. Tool results and resource contents can now include audio blocks with base64-encoded data and a MIME type. This means an MCP server can return audio recordings, synthesised speech, or extracted audio clips directly as tool output – clients that support audio can play or process them without a separate download step.

// MCP tool returning audio content
server.tool('transcribe_meeting', '...', { recording_uri: z.string() }, async ({ recording_uri }) => {
  const audioBuffer = await downloadAudio(recording_uri);
  const transcript = await transcribe(audioBuffer);

  return {
    content: [
      { type: 'text', text: transcript },
      {
        type: 'audio',
        data: audioBuffer.toString('base64'),
        mimeType: 'audio/wav',
      },
    ],
  };
});

For Gemini specifically, you can pass MCP audio content directly into the inlineData format shown in the examples above. The audio type supports WAV, MP3, FLAC, OGG, and other standard MIME types. For files over ~10MB, use the Gemini Files API to upload first, then reference by file URI.

Failure Modes to Watch

  • File too large for inline: Base64 encoding a 50MB video inline will hit request size limits. Use the Files API for anything over ~10MB to be safe.
  • Unsupported MIME types: Not all MIME types work with all models. Test image/webp and application/x-pdf variants – stick to image/jpeg, image/png, and application/pdf for broadest support.
  • Files API cleanup: Uploaded files persist for 48 hours. For GDPR/CCPA compliance, explicitly delete files after processing with fileManager.deleteFile(name).
  • Audio length limits: Inline audio has a limit of about 20MB; use the Files API for longer recordings. Processing 1 hour of audio uses roughly 1,750 tokens per minute.

What to Build Next

  • Build an invoice processing MCP agent: takes a scanned PDF, extracts line items, calls a lookup_product MCP tool for each SKU, and outputs a reconciled JSON report.
  • Add a get_image resource to an MCP server that returns product photos as blob content – have Gemini analyze them and then call your tag_product tool.

nJoy πŸ˜‰

Lesson 25 of 55: Gemini 2.0 and 2.5 Pro + MCP – Function Calling at Scale

Google’s Gemini 2.0 Flash and 2.5 Pro bring a distinct approach to function calling that differs meaningfully from both OpenAI and Claude. Understanding those differences — particularly around parallel tool execution, the functionDeclarations schema, and how Gemini handles tool results — will save you hours of debugging when you first wire up an MCP server to the Gemini API.

Gemini 2.0 Flash and 2.5 Pro function calling with MCP server diagram dark teal
Gemini function calling bridges the Gemini API to MCP tools via a schema conversion layer.

The Gemini Function Calling Model

Gemini’s tool-calling API is part of its @google/generative-ai package (or the newer @google/genai unified SDK). The key primitives are:

  • FunctionDeclaration – describes a function with a name, description, and JSON Schema parameters
  • FunctionCall – the model’s request to invoke a function (name + args as a plain object)
  • FunctionResponse – your code’s reply to a function call (name + response object)
  • Tool – a wrapper around an array of FunctionDeclaration objects

Gemini can issue multiple FunctionCall parts in a single response, meaning it supports parallel tool calling natively. This is a significant performance advantage when your agent can execute tools concurrently.

Installing the SDK

npm install @google/generative-ai @modelcontextprotocol/sdk zod

Use Node.js 22+ with "type": "module" in your package.json. Store your API key as GEMINI_API_KEY in a .env file and load it with node --env-file=.env.

Converting MCP Tools to Gemini FunctionDeclarations

The MCP SDK returns tools as JSON Schema objects. Gemini’s FunctionDeclaration also uses JSON Schema for parameters, but the wrapper format differs. The conversion is straightforward:

import { GoogleGenerativeAI } from '@google/generative-ai';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const genai = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

// Connect to an MCP server
const transport = new StdioClientTransport({
  command: 'node',
  args: ['./servers/product-server.js'],
});

const mcp = new Client({ name: 'gemini-host', version: '1.0.0' });
await mcp.connect(transport);

// List MCP tools
const { tools: mcpTools } = await mcp.listTools();

// Convert to Gemini FunctionDeclarations
function mcpToolToGeminiDeclaration(tool) {
  return {
    name: tool.name,
    description: tool.description,
    parameters: tool.inputSchema,  // MCP already uses JSON Schema - pass through directly
  };
}

const geminiTools = [
  {
    functionDeclarations: mcpTools.map(mcpToolToGeminiDeclaration),
  },
];

This near-zero conversion cost is one of Gemini’s practical advantages over OpenAI. Because both MCP and Gemini use raw JSON Schema for parameters, you avoid the nested function wrapper that OpenAI requires. In a multi-provider setup, fewer transformations mean fewer bugs.

MCP JSON Schema parameters being converted to Gemini FunctionDeclaration format comparison dark diagram
MCP’s JSON Schema for tool parameters maps directly to Gemini’s parameters field – no deep transformation needed.

The Full Tool Calling Loop

async function runGeminiMcpLoop(userMessage) {
  const model = genai.getGenerativeModel({
    model: 'gemini-2.0-flash',
    tools: geminiTools,
  });

  const chat = model.startChat();
  const messages = [{ role: 'user', parts: [{ text: userMessage }] }];

  let response = await chat.sendMessage(userMessage);
  let candidate = response.response.candidates[0];

  while (candidate.finishReason === 'STOP' && hasFunctionCalls(candidate)) {
    const functionCalls = candidate.content.parts.filter(p => p.functionCall);

    // Execute all function calls in parallel (Gemini can issue multiple at once)
    const results = await Promise.all(
      functionCalls.map(async (part) => {
        const fc = part.functionCall;
        const mcpResult = await mcp.callTool({
          name: fc.name,
          arguments: fc.args,
        });
        const text = mcpResult.content
          .filter(c => c.type === 'text')
          .map(c => c.text)
          .join('\n');
        return {
          functionResponse: {
            name: fc.name,
            response: { result: text },
          },
        };
      })
    );

    // Send all function responses back in a single turn
    response = await chat.sendMessage(results);
    candidate = response.response.candidates[0];
  }

  return candidate.content.parts
    .filter(p => p.text)
    .map(p => p.text)
    .join('');
}

function hasFunctionCalls(candidate) {
  return candidate.content.parts.some(p => p.functionCall);
}

Note the key difference from OpenAI and Claude: Gemini uses a Chat session object (model.startChat()) with chat.sendMessage() instead of stateless messages arrays. The chat object maintains conversation history internally.

In production, this loop is the heartbeat of your agent. Every real-world Gemini MCP integration – from customer support bots to internal data dashboards – runs some version of this pattern. Getting it right here means the rest of your application can treat tool calling as a solved problem.

Gemini 2.5 Pro: Longer Context and Better Reasoning

Switch from gemini-2.0-flash to gemini-2.5-pro-preview-03-25 (or the latest stable alias) for tasks requiring deeper reasoning:

const model = genai.getGenerativeModel({
  model: 'gemini-2.5-pro-preview-03-25',  // 1M token context window
  tools: geminiTools,
  generationConfig: {
    temperature: 0.7,
    maxOutputTokens: 8192,
  },
});

Gemini 2.5 Pro’s 1-million-token context window makes it ideal for MCP agents that need to analyze large datasets, entire codebases, or long document collections through resource-fetching tools.

With the model tier covered, the next question is speed. Even the best model is slow if it makes tool calls sequentially when it could run them in parallel. This is where Gemini’s default parallel calling behavior becomes especially valuable.

Parallel Tool Execution: The Gemini Advantage

When Gemini issues multiple function calls in one response, it means it has determined those calls can be satisfied independently. This enables true parallel execution at your application layer:

// Gemini may respond with multiple FunctionCall parts simultaneously
// Example: searching multiple databases at once
// candidate.content.parts = [
//   { functionCall: { name: 'search_products', args: { query: 'laptop' } } },
//   { functionCall: { name: 'get_inventory', args: { category: 'electronics' } } },
//   { functionCall: { name: 'get_pricing', args: { tier: 'enterprise' } } },
// ]

// Your Promise.all() handles them concurrently - real parallelism
const results = await Promise.all(functionCalls.map(callMcpTool));

Compare this to OpenAI (which also supports parallel calls) and Claude (which sequences calls unless you explicitly enable parallel tool use in the beta). Gemini’s default is to use parallelism aggressively when it makes sense.

Handling Errors in Tool Responses

async function callMcpToolSafe(fc, mcpClient) {
  try {
    const result = await mcpClient.callTool({
      name: fc.name,
      arguments: fc.args,
    });
    if (result.isError) {
      return {
        functionResponse: {
          name: fc.name,
          response: { error: result.content[0]?.text ?? 'Tool returned an error' },
        },
      };
    }
    const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
    return {
      functionResponse: {
        name: fc.name,
        response: { result: text },
      },
    };
  } catch (err) {
    return {
      functionResponse: {
        name: fc.name,
        response: { error: `Execution failed: ${err.message}` },
      },
    };
  }
}

Getting error handling right is critical because MCP tools are external processes that can fail for reasons completely outside your control: a database connection drops, an API rate-limits you, or the tool process crashes. Wrapping every call in a safe executor ensures your agent loop never hangs on a single broken tool.

“Function calling lets you connect Gemini models to external tools and APIs. Rather than processing all data internally, the model generates structured function calls that your application executes.” – Google AI for Developers, Function Calling Guide

With the core integration, model selection, parallel execution, and error handling covered, it is worth cataloging the specific ways things break in practice. These failure modes are drawn from real Gemini MCP deployments, not theoretical edge cases.

Common Failure Modes

  • Null parameters schema: If a tool has no parameters, pass parameters: { type: 'object', properties: {} } – omitting the field entirely causes a 400 error.
  • Nested arrays in schemas: Gemini is stricter about nested array schemas than OpenAI. Test each tool schema independently with a simple test call before integrating.
  • Chat session state: The Chat object holds history in memory. For multi-user applications, create a new startChat() per session – do not share a chat instance across users.
  • finishReason misread: Always check candidate.finishReason. A value of 'SAFETY' or 'RECITATION' means the response was blocked – handle these as errors rather than silently continuing.

What to Build Next

  • Swap gemini-2.0-flash for gemini-2.5-pro and pass a 500KB document as a user message – observe how the model leverages full-context reasoning alongside tool calls.
  • Add a maximumTurns guard to your tool loop to prevent infinite agent loops.
  • Log response.response.usageMetadata.totalTokenCount to measure the cost of each agent run.

nJoy πŸ˜‰

Lesson 24 of 55: Production Patterns for Claude + MCP (Caching, Retries, Tool Guards)

Three months of production experience with Claude + MCP teaches you things that no documentation covers. The retry patterns that actually work. The system prompts that reduce hallucinated tool calls. The caching strategies that cut your bill in half. The error classes you will encounter and the ones that silently corrupt output. This lesson consolidates those hard-won patterns into a reference you can apply directly.

Production Claude MCP patterns overview diagram showing caching retry budget observability blocks dark
Production Claude + MCP: the patterns that separate reliable systems from ones that fail at 2am.

Prompt Caching for Cost Reduction

Anthropic’s prompt caching feature caches portions of the input prompt that do not change between requests. For MCP applications, the tool definitions and system prompt are perfect candidates – they are typically the same for every user in a session. Caching them can reduce costs by 50-90% on repeated calls.

// Enable prompt caching for stable content
const response = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 4096,
  system: [
    {
      type: 'text',
      text: `You are a helpful assistant with access to our product database and order management system.
Always verify product availability before confirming orders.
Format all prices in USD.`,
      cache_control: { type: 'ephemeral' },  // Cache this system prompt
    },
  ],
  tools: claudeTools.map(t => ({
    ...t,
    // Cache tool definitions - they rarely change
    ...(claudeTools.indexOf(t) === claudeTools.length - 1
      ? { cache_control: { type: 'ephemeral' } }  // Cache after last tool definition
      : {}),
  })),
  messages,
});

// Check cache performance in usage stats
const usage = response.usage;
console.error(`Cache: ${usage.cache_read_input_tokens} hit, ${usage.cache_creation_input_tokens} created`);

In practice, the 5-minute rolling cache window means that prompt caching works best for active sessions with frequent back-and-forth. For batch jobs or infrequent requests, the cache expires between calls and you will not see meaningful savings.

“Prompt caching enables you to cache portions of your prompt. Cached data is stored server-side for a rolling 5-minute period, after which it expires. Cache hits save 90% of input token costs for the cached portion.” – Anthropic Documentation, Prompt Caching

System Prompt Patterns That Work

Claude responds better to system prompts that describe the persona, define tool usage rules, specify output format, and set boundaries – in that order. Vague system prompts produce vague tool use.

const PRODUCTION_SYSTEM_PROMPT = `You are a precise product research assistant for TechStore.

TOOL USAGE RULES:
1. Always call search_products before making any recommendations
2. For price comparisons, call get_product_price for each product separately
3. If a product has less than 3 reviews, note "limited reviews" in your response
4. Never recommend products that are out of stock (use check_availability first)
5. If tools return errors, explain what you could not verify rather than guessing

OUTPUT FORMAT:
- Lead with the recommendation, then supporting evidence
- Include price, rating, and availability for each recommended product
- Use bullet points for product comparisons
- End with "Note: Stock and prices verified at [current timestamp]"

BOUNDARIES:
- You can only recommend products from our catalogue
- Do not speculate about products not in the search results
- If the user asks for something outside our catalogue, say so clearly`;

A well-structured system prompt is the single highest-leverage improvement you can make to a Claude + MCP integration. Vague prompts like “be helpful and use tools when needed” produce erratic tool usage. Specific rules about when to call which tool, in what order, and how to handle errors reduce hallucinated tool calls dramatically.

Anthropic prompt caching diagram showing system prompt and tool definitions cached versus messages uncached cost reduction
Prompt caching: static content (system prompt, tool definitions) cached at 90% discount; dynamic messages are not cached.

Production Error Taxonomy

// Claude API errors and how to handle them

// 429 - Rate limit: retry with exponential backoff
// 529 - Overloaded: retry with longer backoff (Anthropic load)
// 400 - Bad request: check tool schema, messages format, max_tokens
// 401 - Auth error: check ANTHROPIC_API_KEY
// 413 - Request too large: trim context or summarize conversation history

// Non-error patterns to watch:
// stop_reason === 'max_tokens' - response was cut off, increase max_tokens
// stop_reason === 'end_turn' but no text - model may be stuck, check context

async function callClaudeWithRetry(params, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await anthropic.messages.create(params);
    } catch (err) {
      const shouldRetry = err.status === 429 || err.status === 529 || err.status >= 500;
      if (!shouldRetry || attempt === maxRetries) throw err;

      const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
      const retryAfter = err.headers?.['retry-after']
        ? parseInt(err.headers['retry-after']) * 1000
        : delay;

      console.error(`[claude] Attempt ${attempt} failed (${err.status}), retrying in ${retryAfter}ms`);
      await new Promise(r => setTimeout(r, retryAfter));
    }
  }
}

Rate limiting (429) is the error you will hit most often during development and load testing. Anthropic’s rate limits are per-organization, so one runaway script can block your entire team. Always implement retry logic before you start scaling, not after.

Context Management for Long Conversations

// Summarise old conversation history when approaching context limits
// Claude 3.5 Sonnet context: 200K tokens (allows very long conversations)
// But cost grows linearly with context - summarize for efficiency

async function summariseHistory(messages, anthropicClient) {
  const summaryRequest = await anthropicClient.messages.create({
    model: 'claude-3-5-haiku-20241022',  // Use cheaper model for summarisation
    max_tokens: 500,
    messages: [
      ...messages,
      { role: 'user', content: 'Summarise our conversation so far in 3 bullet points, preserving all key facts found via tool calls.' },
    ],
  });
  return summaryRequest.content[0].text;
}

// In your main conversation loop, check token usage:
if (response.usage.input_tokens > 50000) {
  const summary = await summariseHistory(messages, anthropic);
  messages = [{ role: 'user', content: `Previous conversation summary:\n${summary}` }];
}

Context summarization is a cost optimization, not just a technical constraint. Even if your conversation fits within Claude’s 200K context window, sending 100K tokens per request is expensive. Summarizing early keeps your per-request cost predictable and your latency consistent.

Failure Mode: model Outputting Tool Calls That Do Not Exist

// Claude occasionally hallucinates tool names, especially if tool descriptions are vague
// Guard against this at the execution layer
const toolNames = new Set(mcpTools.map(t => t.name));

for (const toolUse of toolUseBlocks) {
  if (!toolNames.has(toolUse.name)) {
    console.error(`[warn] Claude called non-existent tool: ${toolUse.name}`);
    toolResults.push({
      type: 'tool_result',
      tool_use_id: toolUse.id,
      content: [{ type: 'text', text: `Tool '${toolUse.name}' does not exist. Available tools: ${[...toolNames].join(', ')}` }],
      is_error: true,
    });
    continue;
  }
  // ... execute valid tool
}

Hallucinated tool names are surprisingly common when your server exposes many tools with similar names, or when tool descriptions are ambiguous. The validation pattern above is cheap insurance: a few lines of code that prevent cascading failures from a single bad tool call.

What to Check Right Now

  • Enable prompt caching – add cache_control: { type: 'ephemeral' } to your system prompt and the last tool definition. Check the usage.cache_read_input_tokens to measure savings.
  • Add a tool existence check – validate every tool name Claude returns before attempting to execute it via MCP. Hallucinated tool calls happen in production.
  • Monitor stop reasons – log every stop_reason. A high rate of max_tokens stops means you need to increase max_tokens or summarize context sooner.
  • Measure prompt cache hit rates – aim for >70% cache hit rate in sustained conversations. Low hit rates mean your “static” content is actually varying between calls.

nJoy πŸ˜‰

Lesson 23 of 55: Claude Code, Agent Skills, and MCP for Autonomous Coding

Claude Code is Anthropic’s autonomous coding agent – and it is built on MCP. When Claude Code reads files, runs tests, executes commands, and browses documentation, it does all of this through MCP servers. The architecture is not a coincidence: it is Anthropic demonstrating exactly how a production-grade autonomous agent should integrate with external systems. Understanding how Claude Code uses MCP is one of the fastest ways to understand how you should build your own agents.

Claude Code architecture diagram showing MCP servers for filesystem bash tools web browser connected to Claude agent dark
Claude Code is an MCP host: it connects to filesystem, bash, and browser servers, and orchestrates them with Claude 3.5.

Claude Code’s MCP Architecture

Claude Code (the CLI tool, claude) operates as an MCP host. When it starts, it connects to a set of built-in MCP servers that provide its core capabilities: computer-use (screen reading/clicking), bash (shell command execution), and files (filesystem read/write). You can extend Claude Code with your own custom MCP servers, making it immediately capable of working with your specific project tools.

# ~/.claude/config.json - Extend Claude Code with your MCP servers
{
  "mcpServers": {
    "my-project-tools": {
      "command": "node",
      "args": ["./tools/mcp-server.js"],
      "env": {
        "DATABASE_URL": "postgresql://localhost/mydb"
      }
    },
    "github-tools": {
      "command": "npx",
      "args": ["-y", "@anthropic-ai/mcp-github"]
    }
  }
}

Once configured, Claude Code can use your custom server’s tools as naturally as it uses its built-in bash or filesystem tools. Your create_github_issue tool becomes as usable as Bash(git commit).

Building Agent Skills for Claude Code

The most powerful Claude Code extension pattern is the “agent skill” – a specialised MCP server that encapsulates a complex workflow as a single callable tool. Instead of Claude figuring out the 20-step process to deploy a microservice, you encode those steps in a deploy_service tool that handles all the complexity.

// deploy-skill-server.js
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import { z } from 'zod';
import { execFile } from 'node:child_process';
import { promisify } from 'node:util';

const exec = promisify(execFile);
const server = new McpServer({ name: 'deploy-skills', version: '1.0.0' });

server.tool(
  'deploy_service',
  `Deploy a microservice to Kubernetes. Handles build, push, and rollout.
  Returns the deployment status and the new pod count.`,
  {
    service_name: z.string().describe('Name of the service to deploy'),
    image_tag: z.string().describe('Docker image tag to deploy'),
    namespace: z.string().default('production').describe('Kubernetes namespace'),
    replicas: z.number().int().min(1).max(10).default(2),
  },
  async ({ service_name, image_tag, namespace, replicas }) => {
    const steps = [];

    // Step 1: Build
    const { stdout: buildOut } = await exec('docker', [
      'build', '-t', `${service_name}:${image_tag}`, './services/' + service_name
    ]);
    steps.push('Build: OK');

    // Step 2: Push
    await exec('docker', ['push', `myregistry/${service_name}:${image_tag}`]);
    steps.push('Push: OK');

    // Step 3: Deploy
    const manifest = generateK8sManifest(service_name, image_tag, namespace, replicas);
    await exec('kubectl', ['apply', '-f', '-'], { input: manifest });
    steps.push(`Deploy: OK (${replicas} replicas)`);

    // Step 4: Wait for rollout
    await exec('kubectl', ['rollout', 'status', `deployment/${service_name}`, '-n', namespace]);
    steps.push('Rollout: Complete');

    return {
      content: [{ type: 'text', text: steps.join('\n') + `\n\nService ${service_name} deployed successfully.` }],
    };
  }
);

const transport = new StdioServerTransport();
await server.connect(transport);
Claude Code agent skills diagram showing complex deployment workflow encapsulated as single MCP tool dark
Agent skills: encode complex multi-step workflows as atomic MCP tools. Claude calls one tool instead of orchestrating ten.

Permission Modes in Claude Code

Claude Code has a permission system that controls what actions it can take without asking for confirmation. MCP tools are subject to the same permission model. You can configure Claude Code to auto-approve specific tools, require confirmation for destructive operations, or run in fully supervised mode.

# .claude/settings.json (project-level)
{
  "permissions": {
    "allow": [
      "Bash(git *)",              # Allow all git commands
      "mcp:my-project-tools:read_*",  # Allow read-only tools from my server
      "Read(**)"                   # Allow reading any file
    ],
    "deny": [
      "mcp:my-project-tools:deploy_*",  # Always ask before deploying
      "Bash(rm -rf *)"            # Never auto-approve recursive deletes
    ]
  }
}

“Claude Code is designed to be an autonomous coding agent that can understand and work on complex codebases. It uses a set of built-in tools and can be extended with custom MCP servers to access domain-specific capabilities.” – Anthropic Documentation, Claude Code

Failure Modes with Claude Code MCP Extensions

Case 1: Tools That Are Too Granular

// BAD: Too granular - Claude has to call many tools in sequence and may make mistakes
server.tool('set_k8s_namespace', '...', { ns: z.string() }, handler);
server.tool('set_k8s_image', '...', { image: z.string() }, handler);
server.tool('apply_k8s_manifest', '...', { manifest: z.string() }, handler);
server.tool('watch_k8s_rollout', '...', { deployment: z.string() }, handler);

// BETTER: One atomic skill tool that handles the whole workflow
server.tool('deploy_service', 'Deploy a service to k8s...', { service: z.string(), ... }, handler);

Case 2: Forgetting to Handle Long-Running Operations

// Build + deploy can take minutes
// Don't timeout. Stream progress via notifications or use progress indicators
// Claude Code will wait, but it needs feedback to know the tool is running

server.tool('build_and_deploy', '...', { ... }, async ({ service }) => {
  // Send progress
  process.stderr.write(`[build] Starting build for ${service}...\n`);
  await buildService(service);  // May take 2-10 minutes
  process.stderr.write(`[deploy] Deploying ${service}...\n`);
  await deployService(service);
  return { content: [{ type: 'text', text: 'Done.' }] };
});

What to Check Right Now

  • Install Claude Codenpm install -g @anthropic-ai/claude-code. Then run claude in a project directory to see it in action.
  • Add your MCP server to Claude Code config – add it to ~/.claude/config.json or .claude/config.json (project-level). Then run claude and ask it to use your tool.
  • Design tools as atomic workflows – each tool should complete one meaningful unit of work end-to-end. Avoid exposing low-level implementation details as separate tools.
  • Review the permission system – set appropriate allow and deny rules for your project. Deny destructive tools by default and require explicit confirmation.

nJoy πŸ˜‰

Lesson 22 of 55: Claude Extended Thinking Mode With MCP Tools

Claude 3.7 Sonnet introduced extended thinking – a mode where the model spends additional compute on internal reasoning before producing its response. When combined with MCP tools, extended thinking transforms how the model approaches complex multi-step tasks: instead of immediately deciding to call a tool, Claude reasons through what it knows, what it needs, which tools would help, and what order to call them in. The result is dramatically fewer redundant tool calls and significantly better decisions on ambiguous tasks.

Claude extended thinking diagram showing internal reasoning loop before tool calls in dark minimal design
Extended thinking: Claude reasons internally before deciding to call tools – reducing noise, improving decisions.

Enabling Extended Thinking

Extended thinking is enabled by adding the thinking block to the API request. You control the “budget” – the maximum number of tokens Claude can use for internal reasoning. A higher budget allows deeper reasoning but adds latency and cost.

import Anthropic from '@anthropic-ai/sdk';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const mcpClient = new Client({ name: 'thinking-host', version: '1.0.0' }, { capabilities: {} });
await mcpClient.connect(new StdioClientTransport({ command: 'node', args: ['server.js'] }));
const { tools: mcpTools } = await mcpClient.listTools();
const claudeTools = mcpTools.map(t => ({ name: t.name, description: t.description, input_schema: t.inputSchema }));

async function runWithExtendedThinking(userMessage, thinkingBudget = 8000) {
  const messages = [{ role: 'user', content: userMessage }];

  while (true) {
    const response = await anthropic.messages.create({
      model: 'claude-3-7-sonnet-20250219',
      max_tokens: 16000,  // Must be > thinking budget
      thinking: {
        type: 'enabled',
        budget_tokens: thinkingBudget,  // Min: 1024, no hard max
      },
      tools: claudeTools,
      messages,
    });

    // Response may contain thinking blocks - they appear before text/tool_use
    const thinkingBlocks = response.content.filter(b => b.type === 'thinking');
    const textBlocks = response.content.filter(b => b.type === 'text');
    const toolUseBlocks = response.content.filter(b => b.type === 'tool_use');

    if (process.env.SHOW_THINKING) {
      for (const tb of thinkingBlocks) {
        console.error('\n[thinking]', tb.thinking.slice(0, 500) + '...');
      }
    }

    messages.push({ role: 'assistant', content: response.content });

    if (response.stop_reason === 'tool_use') {
      const toolResults = await Promise.all(
        toolUseBlocks.map(async (toolUse) => {
          const result = await mcpClient.callTool({
            name: toolUse.name,
            arguments: toolUse.input,
          });
          return { type: 'tool_result', tool_use_id: toolUse.id, content: result.content };
        })
      );
      messages.push({ role: 'user', content: toolResults });
    } else {
      return textBlocks.map(b => b.text).join('');
    }
  }
}

// Complex task: extended thinking shines here
const result = await runWithExtendedThinking(
  `I need to buy a laptop for machine learning research. 
   My budget is $2000. I prefer AMD GPUs but would consider NVIDIA. 
   It must have at least 32GB RAM expandable to 64GB, 
   and I work across Windows and Linux so driver support matters.
   Research and recommend the top 3 options.`,
  12000  // Higher budget for complex task
);

console.log(result);
await mcpClient.close();
Claude thinking budget dial showing trade-off between reasoning depth latency cost with task complexity axis dark
Choosing the thinking budget: simple tasks need 1,000-4,000 tokens; complex research tasks benefit from 8,000+.

When to Use Extended Thinking with MCP

Extended thinking is not free – it adds significant latency (often 10-30 seconds for high budgets) and substantial token cost. Use it selectively:

  • Use it for: complex research requiring 5+ tool calls, tasks requiring careful tradeoff analysis, situations where tool call order significantly affects outcome quality
  • Skip it for: simple lookups, single-tool tasks, time-sensitive queries, high-volume low-latency applications
// Adaptive thinking budget based on task complexity
function getThinkingBudget(task) {
  const wordCount = task.split(/\s+/).length;
  const hasComparisons = /compare|vs|versus|between|best|recommend/.test(task.toLowerCase());
  const hasMultipleRequirements = task.split(/and|also|additionally|plus/).length > 2;

  if (hasComparisons && hasMultipleRequirements) return 10000;
  if (hasComparisons || hasMultipleRequirements) return 5000;
  if (wordCount > 50) return 3000;
  return 0;  // No thinking for simple tasks
}

const budget = getThinkingBudget(userInput);
if (budget > 0) {
  return runWithExtendedThinking(userInput, budget);
} else {
  return runWithClaude(userInput); // Standard tool calling
}

“Extended thinking causes Claude to reason more thoroughly about tasks before responding, which can substantially improve performance on complex tasks. Thinking tokens are not cached and must be included in the context window when continuing a conversation.” – Anthropic Documentation, Extended Thinking

Failure Modes with Extended Thinking

Case 1: Setting max_tokens Less Than Thinking Budget

// WRONG: max_tokens must exceed budget_tokens
const response = await anthropic.messages.create({
  max_tokens: 4096,
  thinking: { type: 'enabled', budget_tokens: 8000 }, // 8000 > 4096 - API error!
});

// CORRECT: max_tokens must be greater than budget_tokens
const response = await anthropic.messages.create({
  max_tokens: 16000,
  thinking: { type: 'enabled', budget_tokens: 8000 }, // Valid: 16000 > 8000
});

Case 2: Not Passing Thinking Blocks Back in Continuation

// When continuing a conversation with extended thinking enabled,
// thinking blocks from previous turns MUST be included in the messages array.
// The SDK handles this automatically if you push the full response.content.
messages.push({ role: 'assistant', content: response.content }); // Include ALL blocks including thinking

What to Check Right Now

  • Test with SHOW_THINKING=1 – run your agent with thinking visible. Reading the thinking output reveals what the model understood about the task and why it chose each tool.
  • Measure latency impact – log response time with and without extended thinking on the same tasks. Quantify the tradeoff for your use case before deploying at scale.
  • Start with budget 4000-8000 – this range gives substantially improved reasoning for most tasks without the extreme latency of budgets above 15,000.
  • Use claude-3-5-sonnet for anything where speed > accuracy – 3.5 Sonnet without thinking is typically faster and cheaper for tasks where the tradeoff makes sense.

nJoy πŸ˜‰

Lesson 21 of 55: Claude 3.5 and 3.7 + MCP – Native Tool Calling Patterns

Claude’s tool use is the cleanest tool calling implementation in the major LLM providers. The API is symmetric: you send tools in the request, Claude returns tool_use blocks when it wants to call something, you run the tools, and you send back tool_result blocks. No function/tool naming confusion, no finish_reason gotchas – just a clear, typed message structure. This lesson builds the Claude + MCP integration from scratch, comparing it to the OpenAI pattern where they differ.

Claude tool use message format diagram showing tool-use block and tool-result block on dark background
Claude’s tool calling: tool_use blocks in assistant messages, tool_result blocks in user messages.

Claude Tool Use Format

Claude’s tool use has a fundamentally different message structure from OpenAI’s. The key difference: tool results go in a user message (not a separate role), nested inside a tool_result content block that references the tool use ID. This is more structured and less ambiguous than OpenAI’s approach.

// Claude tool calling message flow:

// 1. Request: tools defined, user message sent
// 2. Response: Claude returns tool_use block(s)
{
  role: 'assistant',
  content: [
    { type: 'text', text: 'Let me search for that.' },
    {
      type: 'tool_use',
      id: 'toolu_01XY',
      name: 'search_products',
      input: { query: 'wireless headphones', limit: 5 }
    }
  ]
}

// 3. You execute the tool through MCP
// 4. Send result back in a user message with tool_result block
{
  role: 'user',
  content: [{
    type: 'tool_result',
    tool_use_id: 'toolu_01XY',
    content: [{ type: 'text', text: 'Found 5 products: ...' }]
  }]
}

The Complete Claude + MCP Integration

import Anthropic from '@anthropic-ai/sdk';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const mcpClient = new Client({ name: 'claude-host', version: '1.0.0' }, { capabilities: {} });
await mcpClient.connect(new StdioClientTransport({ command: 'node', args: ['server.js'] }));

const { tools: mcpTools } = await mcpClient.listTools();

// Convert MCP tools to Anthropic format
const claudeTools = mcpTools.map(t => ({
  name: t.name,
  description: t.description,
  input_schema: t.inputSchema,  // Note: input_schema, not parameters
}));

async function runWithClaude(userMessage) {
  const messages = [{ role: 'user', content: userMessage }];

  while (true) {
    const response = await anthropic.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 4096,
      tools: claudeTools,
      messages,
    });

    // Append Claude's response to messages
    messages.push({ role: 'assistant', content: response.content });

    // If Claude stopped due to tool use, execute tools
    if (response.stop_reason === 'tool_use') {
      const toolUseBlocks = response.content.filter(b => b.type === 'tool_use');
      const toolResults = await Promise.all(
        toolUseBlocks.map(async (toolUse) => {
          console.error(`[tool] Calling: ${toolUse.name}`, toolUse.input);
          const result = await mcpClient.callTool({
            name: toolUse.name,
            arguments: toolUse.input,
          });

          return {
            type: 'tool_result',
            tool_use_id: toolUse.id,
            content: result.content,  // MCP content blocks work directly here
          };
        })
      );

      // Tool results go in a user message
      messages.push({ role: 'user', content: toolResults });

    } else {
      // end_turn or other stop reason - extract final text
      const finalText = response.content
        .filter(b => b.type === 'text')
        .map(b => b.text)
        .join('');

      return finalText;
    }
  }
}

const result = await runWithClaude('Compare the best noise-cancelling headphones under $300');
console.log(result);
await mcpClient.close();
Side by side comparison of Claude tool use format versus OpenAI function calling format showing differences dark
Claude vs OpenAI tool calling: the message structure differs but the underlying logic is the same.

Claude 3.5 vs 3.7 Sonnet for Tool Use

Claude 3.5 Sonnet (20241022) is the current production choice for tool-heavy workloads: fast, reliable tool calls, good at following tool descriptions, and competitive pricing. Claude 3.7 Sonnet adds extended thinking (covered in Lesson 21) and improved reasoning for complex multi-step tool chains, at higher latency and cost.

// For fast, reliable tool calling:
model: 'claude-3-5-sonnet-20241022'

// For complex reasoning + tool use:
model: 'claude-3-7-sonnet-20250219'  // Includes extended thinking

// Haiku for high-volume, simple tool tasks:
model: 'claude-3-5-haiku-20241022'

“Claude is trained to use tools in the same way that humans do: by processing what it’s seen before and uses this context to craft appropriate tool calls or final responses. Tool use enables Claude to interact with external services and APIs in a structured way.” – Anthropic Documentation, Tool Use

Key Differences from OpenAI

Aspect Claude (Anthropic) GPT (OpenAI)
Tool result role user tool
Schema field input_schema parameters
Tool call detection stop_reason === 'tool_use' finish_reason === 'tool_calls'
Multiple tools All results in one user message Each result is a separate tool message
Tool call args toolUse.input (already parsed) JSON.parse(toolCall.function.arguments)

Failure Modes with Claude Tool Use

Case 1: Putting Tool Results in an Assistant Message

// WRONG: Tool results in wrong role
messages.push({ role: 'assistant', content: toolResults }); // API error

// CORRECT: Tool results go in user role
messages.push({ role: 'user', content: toolResults });

Case 2: Forgetting that Claude’s input Is Already Parsed JSON

// WRONG: Trying to JSON.parse Claude's tool input
const args = JSON.parse(toolUse.input); // Error: toolUse.input is already an object

// CORRECT: Use directly - Claude's SDK already parses it
const args = toolUse.input; // Already an object like { query: "...", limit: 5 }
await mcpClient.callTool({ name: toolUse.name, arguments: args });

What to Check Right Now

  • Test with a multi-tool Claude response – ask a question that forces 2-3 tool calls in one response. Verify all tool use blocks are collected and all results are bundled into one user message.
  • Verify input_schema not parameters – this is the single most common copy-paste error when moving from OpenAI to Claude code. Search your code for parameters in Claude tool definitions.
  • Handle vision content in tool results – Claude can process image content blocks in tool results. If your MCP tools return images (base64), pass them through as { type: 'image', source: ... } in the tool result content array.
  • Set a system prompt – Claude responds well to clear system prompts. Define the assistant’s persona, task scope, and output format at the system level.

nJoy πŸ˜‰

Lesson 20 of 55: Building a Production OpenAI Client for MCP Tool Loops

The gap between “demo that calls a tool” and “production client that handles 10,000 daily users” is everything we have not talked about yet: connection pooling, retry logic, cost control, token budget management, error classification, telemetry, and graceful degradation. This lesson builds a production-grade OpenAI MCP client library from scratch – the kind you would actually deploy in a company. Every pattern here comes from real production failure modes.

Production OpenAI MCP client architecture diagram showing connection pool retry logic telemetry dark
A production MCP client: connection management, retry, budget control, and telemetry baked in.

The Production Client Library

// mcp-openai-client.js - Production-grade MCP + OpenAI client

import OpenAI from 'openai';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const DEFAULT_CONFIG = {
  model: 'gpt-4o',
  maxTokens: 4096,
  maxIterations: 15,
  temperature: 0.1,
  retries: 3,
  retryDelay: 1000,  // ms
  budgetUSD: 0.50,   // Max cost per conversation
  timeoutMs: 120_000, // 2 minute timeout per conversation
};

// Token cost estimates (USD per 1M tokens, approximate)
const MODEL_COSTS = {
  'gpt-4o': { input: 2.50, output: 10.00 },
  'gpt-4o-mini': { input: 0.15, output: 0.60 },
  'o3': { input: 15.00, output: 60.00 },
  'o3-mini': { input: 1.10, output: 4.40 },
};

export class McpOpenAIClient {
  constructor(mcpServerConfig, options = {}) {
    this.config = { ...DEFAULT_CONFIG, ...options };
    this.openai = new OpenAI({ apiKey: options.apiKey || process.env.OPENAI_API_KEY });
    this.mcpServerConfig = mcpServerConfig;
    this.mcpClient = null;
    this.tools = [];
    this.totalCostUSD = 0;
  }

  async connect() {
    this.mcpClient = new Client(
      { name: 'production-host', version: '1.0.0' },
      { capabilities: {} }
    );

    const transport = new StdioClientTransport(this.mcpServerConfig);
    await this.mcpClient.connect(transport);

    const { tools } = await this.mcpClient.listTools();
    this.tools = tools.map(t => ({
      type: 'function',
      function: { name: t.name, description: t.description, parameters: t.inputSchema },
    }));

    console.error(`[mcp] Connected - ${this.tools.length} tools available`);
  }

  async disconnect() {
    await this.mcpClient?.close();
  }

  estimateCostUSD(inputTokens, outputTokens, model) {
    const costs = MODEL_COSTS[model] || MODEL_COSTS['gpt-4o'];
    return (inputTokens / 1_000_000) * costs.input + (outputTokens / 1_000_000) * costs.output;
  }

  async executeWithRetry(fn, maxRetries = this.config.retries) {
    let lastError;
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        return await fn();
      } catch (err) {
        lastError = err;
        const isRetryable = err.status === 429 || err.status === 500 || err.status === 503;
        if (!isRetryable || attempt === maxRetries) throw err;

        const delay = this.config.retryDelay * Math.pow(2, attempt - 1); // Exponential backoff
        console.error(`[openai] Attempt ${attempt} failed: ${err.message}. Retrying in ${delay}ms`);
        await new Promise(r => setTimeout(r, delay));
      }
    }
    throw lastError;
  }

  async run(userMessage, systemPrompt = null) {
    const startTime = Date.now();
    const messages = [];

    if (systemPrompt) messages.push({ role: 'system', content: systemPrompt });
    messages.push({ role: 'user', content: userMessage });

    let iteration = 0;
    let totalInputTokens = 0;
    let totalOutputTokens = 0;

    while (true) {
      if (++iteration > this.config.maxIterations) {
        throw new Error(`Exceeded max iterations (${this.config.maxIterations})`);
      }

      if (Date.now() - startTime > this.config.timeoutMs) {
        throw new Error(`Conversation timeout after ${this.config.timeoutMs}ms`);
      }

      if (this.totalCostUSD > this.config.budgetUSD) {
        throw new Error(`Budget exceeded: $${this.totalCostUSD.toFixed(4)} > $${this.config.budgetUSD}`);
      }

      const response = await this.executeWithRetry(() =>
        this.openai.chat.completions.create({
          model: this.config.model,
          messages,
          tools: this.tools.length > 0 ? this.tools : undefined,
          max_tokens: this.config.maxTokens,
          temperature: this.config.temperature,
        })
      );

      const usage = response.usage;
      totalInputTokens += usage?.prompt_tokens || 0;
      totalOutputTokens += usage?.completion_tokens || 0;
      const turnCost = this.estimateCostUSD(
        usage?.prompt_tokens || 0,
        usage?.completion_tokens || 0,
        this.config.model
      );
      this.totalCostUSD += turnCost;

      const choice = response.choices[0];
      const message = choice.message;
      messages.push(message);

      if (choice.finish_reason !== 'tool_calls') {
        const elapsedMs = Date.now() - startTime;
        console.error(`[stats] iterations=${iteration} tokens=${totalInputTokens}+${totalOutputTokens} cost=$${this.totalCostUSD.toFixed(4)} elapsed=${elapsedMs}ms`);
        return {
          content: message.content,
          iterations: iteration,
          totalCostUSD: this.totalCostUSD,
          tokens: { input: totalInputTokens, output: totalOutputTokens },
          elapsedMs,
        };
      }

      // Execute tool calls
      const toolResults = await Promise.all(
        message.tool_calls.map(async (tc) => {
          let args;
          try {
            args = JSON.parse(tc.function.arguments);
          } catch {
            return { role: 'tool', tool_call_id: tc.id, content: 'Error: Invalid tool arguments JSON' };
          }

          try {
            const result = await this.mcpClient.callTool({ name: tc.function.name, arguments: args });
            const text = result.content.filter(c => c.type === 'text').map(c => c.text).join('\n');
            const errorFlag = result.isError ? '[TOOL ERROR] ' : '';
            return { role: 'tool', tool_call_id: tc.id, content: errorFlag + text };
          } catch (err) {
            console.error(`[tool] ${tc.function.name} error: ${err.message}`);
            return { role: 'tool', tool_call_id: tc.id, content: `Tool execution failed: ${err.message}` };
          }
        })
      );

      messages.push(...toolResults);
    }
  }
}
OpenAI cost tracking dashboard showing per-model token costs budget control and usage metrics dark
Cost tracking: estimate cost per turn, accumulate per conversation, enforce budget limits before they hit your bill.

Usage Pattern

import { McpOpenAIClient } from './mcp-openai-client.js';

const client = new McpOpenAIClient(
  { command: 'node', args: ['server.js'] },
  {
    model: 'gpt-4o-mini',
    budgetUSD: 0.10,
    maxIterations: 8,
    timeoutMs: 60_000,
  }
);

await client.connect();

const result = await client.run(
  'Find me a good Python book for beginners under $40',
  'You are a helpful book recommendation assistant.'
);

console.log('Answer:', result.content);
console.log('Cost:', `$${result.totalCostUSD.toFixed(4)}`);
console.log('Iterations:', result.iterations);

await client.disconnect();

“For production deployments, implement exponential backoff for rate limit errors (429). The OpenAI API will return Retry-After headers for rate limits – respect these values.” – OpenAI Documentation, Error Codes

Failure Modes in Production

Case 1: No Budget Control

// A single misbehaving agent with no budget cap can cost hundreds of dollars
// Always set a budgetUSD limit per conversation
// Always set a maxIterations limit per conversation
// Log and alert when conversations exceed 80% of budget

Case 2: Catching All Errors and Retrying Blindly

// Some errors should NOT be retried - e.g. 400 Bad Request (invalid schema)
// 429 = retry (rate limit)
// 500/503 = retry (server error)
// 400 = do NOT retry (your code is wrong)
// 401/403 = do NOT retry (authentication issue)

What to Check Right Now

  • Set per-conversation budgets – $0.10 is a reasonable starting point for most workflows. Adjust based on your model and expected tool call count.
  • Implement exponential backoff – the pattern shown above (doubling delay on each retry) is the industry standard. Start at 1000ms, cap at 60000ms.
  • Log every tool call – production debugging without tool call logs is nearly impossible. Log tool name, arguments, result length, and execution time for every call.
  • Monitor iteration counts – if average iterations are above 8, your tool descriptions or system prompt may be unclear. Investigate and improve before scaling.

nJoy πŸ˜‰

Lesson 19 of 55: OpenAI Responses API and Agents SDK With MCP

OpenAI released the Responses API and the Agents SDK as a unified approach to building agentic workflows. These are not just new API endpoints – they represent OpenAI’s opinionated view of how production agents should be structured. The Responses API replaces the Chat Completions API for agentic use cases. The Agents SDK wraps it with built-in MCP support, tool orchestration, and a pipeline abstraction that handles the looping automatically. This lesson shows you both layers and where MCP plugs in.

OpenAI Responses API and Agents SDK architecture diagram with MCP tool integration dark
The Agents SDK wraps the Responses API with built-in MCP support and automatic tool orchestration.

The Responses API

The Responses API (openai.responses.create()) is designed for stateful, multi-turn agentic sessions. Unlike Chat Completions which requires you to manage conversation history manually, the Responses API maintains state server-side via a response ID. You reference previous responses by ID, and the API handles context management including tool call history.

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// First turn - creates a new response
const response = await openai.responses.create({
  model: 'gpt-4o',
  input: 'Search for the best laptops under $1000',
  tools: openAITools,  // Same format as Chat Completions
});

const responseId = response.id;  // Save this for continuations

// Continue the conversation using the response ID (no need to re-send history)
const followUp = await openai.responses.create({
  model: 'gpt-4o',
  input: 'Now filter to only Dell and Lenovo models',
  previous_response_id: responseId,  // References prior context
  tools: openAITools,
});

“The Responses API is designed specifically for agentic workflows. It maintains conversation state server-side, supports native tool execution, and provides a unified interface for building multi-step AI tasks.” – OpenAI API Reference, Responses

The Agents SDK with MCP

The OpenAI Agents SDK (@openai/agents) provides a higher-level abstraction with native MCP support. Instead of writing the tool calling loop yourself, the SDK handles it automatically. You define an agent with tools and an instruction, and the SDK orchestrates the full pipeline.

import { Agent, run, MCPServerStdio } from '@openai/agents';

// Connect to your MCP server via the SDK's native MCP support
const mcpServer = new MCPServerStdio({
  name: 'my-tools',
  fullCommand: 'node ./my-mcp-server.js',
});

await mcpServer.connect();

// Create an agent with the MCP server's tools
const agent = new Agent({
  name: 'Research Assistant',
  instructions: `You are a research assistant with access to product search and comparison tools.
    Always search for at least 3 options before recommending.
    Format your final recommendation as a clear list with prices.`,
  tools: await mcpServer.listTools(),
  model: 'gpt-4o',
});

// Run the agent - the SDK handles the tool calling loop
const result = await run(agent, 'Find the best wireless headphones under $200');
console.log('Final answer:', result.finalOutput);

// Clean up
await mcpServer.close();
OpenAI Agents SDK pipeline diagram showing Agent definition running with tools and MCP server integration dark
The Agents SDK pipeline: define agent + tools, run with input, SDK handles orchestration automatically.

Handoffs: Multi-Agent Patterns with the SDK

import { Agent, run, handoff } from '@openai/agents';

const searchAgent = new Agent({
  name: 'Search Specialist',
  instructions: 'You specialise in searching and retrieving product data.',
  tools: searchMcpTools,
  model: 'gpt-4o-mini',  // Cheaper model for search
});

const analysisAgent = new Agent({
  name: 'Analysis Specialist',
  instructions: 'You specialise in comparing and recommending products based on data.',
  tools: analysisMcpTools,
  model: 'gpt-4o',        // Smarter model for complex reasoning
  handoffs: [handoff(searchAgent, 'Use search specialist when you need more data')],
});

const result = await run(analysisAgent, 'Compare the top 5 gaming laptops');
console.log(result.finalOutput);

Failure Modes with the Responses API and Agents SDK

Case 1: Not Handling Tool Call Errors in the Responses API

// The Responses API may return partial results if a tool fails
// Always check response.status and handle incomplete states
const response = await openai.responses.create({ ... });

if (response.status === 'incomplete') {
  console.error('Response incomplete:', response.incomplete_details);
  // Handle: retry, use partial output, or escalate
}

Case 2: State Leakage Between Responses API Sessions

// previous_response_id links responses in a chain
// If you reuse an ID from a different user's session, state leaks
// Always scope response IDs to the authenticated user's session store

const userSession = sessions.get(userId);
const response = await openai.responses.create({
  previous_response_id: userSession.lastResponseId || undefined,
  ...
});
userSession.lastResponseId = response.id;

What to Check Right Now

  • Try the Agents SDK first – if you are building a new agent, start with the Agents SDK. The automatic tool loop saves significant boilerplate.
  • Use the Responses API for long sessions – for multi-turn conversations with many tool calls, the Responses API’s server-side state management avoids sending large context windows repeatedly.
  • Test handoff behaviour – if using multi-agent handoffs, test the edge case where the receiving agent decides it does not need to hand off again and loops back incorrectly.
  • Check the Agents SDK version – the SDK is actively developed. Pin the version in package.json and read the changelog when upgrading: npm install @openai/agents.

nJoy πŸ˜‰