Scaling MCP

A single MCP server instance handling one user works fine. The same server handling 500 concurrent users during peak hours is a different problem entirely. This lesson covers the four levers for scaling MCP infrastructure: horizontal scaling with session affinity, rate limiting that protects both your server and upstream LLM APIs, response caching for expensive tool calls, and load balancing configurations that handle MCP’s stateful session requirements correctly.

MCP scaling architecture diagram load balancer multiple server instances Redis session cache rate limiter dark
Horizontal MCP scaling requires: session affinity, shared session store, and a rate limiter at the gateway.

Horizontal Scaling with Shared Session State

MCP Streamable HTTP sessions are stateful. If a client’s POST goes to server A but the next SSE connection goes to server B, the session state is lost. Two solutions:

Option 1: Sticky sessions (simpler) – Configure your load balancer to route all requests from the same client to the same server instance. Works but creates uneven load distribution.

Option 2: Shared session store (recommended) – Store session state in Redis and allow any server instance to handle any request.

import { createClient } from 'redis';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamable-http.js';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

// In-memory session store backed by Redis
const sessions = new Map();  // Local cache for active requests

app.post('/mcp', async (req, res) => {
  const sessionId = req.headers['mcp-session-id'];

  // Try local cache first, then Redis
  let transport = sessions.get(sessionId);
  if (!transport && sessionId) {
    const stored = await redis.get(`mcp:session:${sessionId}`);
    if (stored) {
      // Restore session - in practice, transport state is complex to serialize
      // For true multi-instance support, use sticky sessions at the LB level
      console.error(`Session ${sessionId} not found locally - sticky sessions recommended`);
    }
  }

  if (!transport) {
    transport = new StreamableHTTPServerTransport({
      sessionIdGenerator: () => crypto.randomUUID(),
      onsessioninitialized: async (sid) => {
        sessions.set(sid, transport);
        // Mark session as active in Redis (for health tracking)
        await redis.setEx(`mcp:session:${sid}:active`, 3600, '1');
      },
    });
    const server = buildMcpServer();
    await server.connect(transport);
  }

  await transport.handleRequest(req, res, req.body);
});

Rate Limiting at the Gateway

import { RateLimiterRedis } from 'rate-limiter-flexible';

// Per-user rate limit: 60 requests per minute
const rateLimiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: 'mcp-rl',
  points: 60,        // Number of requests
  duration: 60,      // Per 60 seconds
  blockDuration: 60, // Block for 60 seconds after limit hit
});

// Per-IP rate limit for unauthenticated paths
const ipRateLimiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: 'mcp-ip-rl',
  points: 100,
  duration: 60,
});

async function rateLimit(req, res, next) {
  const key = req.auth?.sub ?? req.ip;
  try {
    await rateLimiter.consume(key);
    next();
  } catch (rl) {
    res.setHeader('Retry-After', Math.ceil(rl.msBeforeNext / 1000));
    res.setHeader('X-RateLimit-Limit', 60);
    res.setHeader('X-RateLimit-Remaining', 0);
    res.status(429).json({ error: 'too_many_requests', retryAfter: Math.ceil(rl.msBeforeNext / 1000) });
  }
}

app.use('/mcp', rateLimit);
Rate limiting Redis sliding window per user request counter 429 response with Retry-After header dark diagram
Redis-backed sliding window rate limiter: 60 req/min per user, returns 429 with Retry-After on breach.

Tool Result Caching

// Cache expensive or read-heavy tool call results in Redis
class ToolResultCache {
  #redis;
  #ttls;

  constructor(redis, ttls = {}) {
    this.#redis = redis;
    this.#ttls = {
      // Default TTLs per tool (seconds)
      get_product: 300,      // 5 min - product data changes rarely
      search_products: 60,   // 1 min - search results change more
      get_inventory: 10,     // 10 sec - inventory changes frequently
      get_user: 600,         // 10 min - user profile rarely changes
      ...ttls,
    };
  }

  key(toolName, args) {
    return `mcp:tool:${toolName}:${JSON.stringify(args)}`;
  }

  async get(toolName, args) {
    const cached = await this.#redis.get(this.key(toolName, args));
    return cached ? JSON.parse(cached) : null;
  }

  async set(toolName, args, result) {
    const ttl = this.#ttls[toolName];
    if (!ttl) return;  // Don't cache if no TTL defined
    await this.#redis.setEx(this.key(toolName, args), ttl, JSON.stringify(result));
  }

  async invalidate(pattern) {
    const keys = await this.#redis.keys(`mcp:tool:${pattern}:*`);
    if (keys.length) await this.#redis.del(keys);
  }
}

const toolCache = new ToolResultCache(redis);

// Wrap MCP callTool with caching
async function callToolWithCache(mcp, name, args) {
  const cached = await toolCache.get(name, args);
  if (cached) {
    return cached;
  }
  const result = await mcp.callTool({ name, arguments: args });
  await toolCache.set(name, args, result);
  return result;
}

Nginx Load Balancer Config with Sticky Sessions

upstream mcp_servers {
    ip_hash;  # Sticky sessions by client IP
    server mcp-server-1:3000;
    server mcp-server-2:3000;
    server mcp-server-3:3000;
    keepalive 64;
}

server {
    listen 443 ssl;
    server_name mcp.yourcompany.com;

    # SSE requires long-lived connections - increase timeouts
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    proxy_connect_timeout 10s;

    # Required for SSE streaming
    proxy_buffering off;
    proxy_cache off;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    chunked_transfer_encoding on;

    location /mcp {
        proxy_pass http://mcp_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location /health {
        proxy_pass http://mcp_servers;
        access_log off;
    }
}

Scaling Decision Guide

  • Under 10 concurrent users: Single instance, no load balancer needed
  • 10-100 concurrent users: 2-3 instances with sticky sessions, Redis for rate limiting
  • 100-1000 concurrent users: 5-10 instances, Redis session store, tool result caching, dedicated rate limiting layer
  • 1000+ concurrent users: Kubernetes with horizontal pod autoscaling, Redis Cluster, API Gateway (Kong, APISIX) for rate limiting and auth

nJoy 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.