Lesson 9 of 55: MCP Sampling - Server-Initiated LLM Calls ...

Here is the mind-bending part of MCP: servers can ask the LLM for help. In the standard model, the flow is one-way – host calls LLM, LLM calls tool, tool runs on server, result goes back. Sampling reverses one arrow. It lets a server, while handling a request, ask the host’s LLM to generate text – and then use that generated text in its response. This is recursive AI, and it is what enables genuinely intelligent MCP servers that reason about their own actions.

MCP sampling flow diagram showing server requesting LLM inference from client creating recursive AI loop on dark background — Sampling: the server requests an LLM inference from the client, enabling server-side reasoning loops.

The Sampling Flow

Sampling works as follows: a server handling a tool call decides it needs to “think” before it can respond. It sends a sampling/createMessage request to the client. The client receives this, shows the pending sampling request to the user (or approves it automatically based on policy), then calls the actual LLM API, and returns the result to the server. The server uses the result to complete its work and returns the final tool response to the original caller.

The critical point: the server does not know which LLM the client is using. It just asks for “a language model response” and gets back generated text. This maintains provider-agnosticism even for server-side reasoning.

// Client configuration to enable sampling
const client = new Client(
  { name: 'my-host', version: '1.0.0' },
  {
    capabilities: {
      sampling: {},  // Must declare this to receive sampling requests from servers
    },
  }
);

// Client must handle incoming sampling requests
client.setRequestHandler(CreateMessageRequestSchema, async (request) => {
  const { messages, maxTokens, temperature } = request.params;

  // Here the host calls its actual LLM
  const openai = new OpenAI();
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: messages.map(m => ({
      role: m.role,
      content: typeof m.content === 'string' ? m.content : m.content.text,
    })),
    max_tokens: maxTokens || 1000,
    temperature: temperature || 0.7,
  });

  return {
    role: 'assistant',
    content: { type: 'text', text: response.choices[0].message.content },
    model: 'gpt-4o',
    stopReason: 'endTurn',
  };
});

Why this matters: without capabilities.sampling the server cannot request completions at all, and without a handler every sampling call fails the tool mid-flight. In a real project you would centralise LLM calls here so quotas, logging, and redaction policies stay in one place on the host.

MCP server using sampling to reason about its own tool execution with request loop diagram dark — A server using sampling to analyse data before returning a structured response.

Server-Side Sampling Usage

On the server side, you request sampling through the server’s sampling capability. Here is a server that uses sampling to classify user intent before deciding which database to query:

import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';

const server = new McpServer({ name: 'smart-search', version: '1.0.0' });

server.tool(
  'intelligent_search',
  'Searches across databases, routing the query based on intent',
  { query: z.string().describe('The search query') },
  async ({ query }, { server: serverInstance }) => {
    // Use sampling to classify the query intent
    const classification = await serverInstance.createMessage({
      messages: [{
        role: 'user',
        content: {
          type: 'text',
          text: `Classify this search query into one of: products, users, orders, docs.\nQuery: "${query}"\nRespond with only the category name.`,
        },
      }],
      maxTokens: 10,
    });

    const category = classification.content.text.trim().toLowerCase();

    // Route to the appropriate search function
    let results;
    switch (category) {
      case 'products': results = await searchProducts(query); break;
      case 'users': results = await searchUsers(query); break;
      case 'orders': results = await searchOrders(query); break;
      default: results = await searchDocs(query);
    }

    return { content: [{ type: 'text', text: JSON.stringify(results) }] };
  }
);

In a real project you would treat the classification step as a bounded, cheap call (low maxTokens, strict prompt) and keep routing logic easy to unit test. If the model returns an unexpected label, fall back to a safe default path instead of failing the whole tool.

“Sampling allows servers to request LLM completions through the client, enabling sophisticated agentic behaviors while maintaining security through human oversight. The client retains control over which model is used and what requests are permitted.” – MCP Documentation, Sampling

Sampling Parameters

The sampling/createMessage request supports model preferences and sampling parameters. These are preferences, not requirements – the client may choose to ignore them if they conflict with its policy or available models.

const response = await serverInstance.createMessage({
  messages: [{ role: 'user', content: { type: 'text', text: 'Summarise in one sentence.' } }],
  maxTokens: 100,
  temperature: 0.3,           // Lower = more deterministic
  modelPreferences: {
    hints: [{ name: 'claude-3-5-haiku' }], // Preferred model - client may ignore
    costPriority: 0.8,         // 0-1: prefer cheaper models
    speedPriority: 0.9,        // 0-1: prefer faster models
    intelligencePriority: 0.2, // 0-1: prefer smarter models
  },
  systemPrompt: 'You are a concise summariser.',
});

Those preferences are negotiation, not a guarantee: the host may pin a single approved model or ignore cost and speed hints for compliance. Use them to express intent, then document what your client actually honours so server authors know what to expect.

Failure Modes with Sampling

Case 1: Using Sampling for Every Decision

Sampling adds latency and cost. Using it for decisions that can be made with deterministic code (string matching, regex, a simple lookup) is waste. Reserve sampling for genuinely ambiguous situations where LLM understanding adds real value.

// WASTEFUL: Sampling for something a regex handles
const isEmail = await serverInstance.createMessage({
  messages: [{ role: 'user', content: { type: 'text', text: `Is "${input}" an email address? Yes or No.` } }],
  maxTokens: 5,
});

// BETTER: Just use a regex
const isEmail = /^[^@]+@[^@]+\.[^@]+$/.test(input);

Why this matters: every sampling round trip adds latency and billed tokens. In a real project you would profile hot tools and replace LLM branches with deterministic code wherever the spec is stable.

Case 2: Infinite Sampling Loops

If a server uses sampling and the LLM response triggers another tool call that uses sampling again, you can create infinite loops. Always set a maximum recursion depth and terminate if exceeded.

// Guard against recursion depth
async function toolHandler({ query }, context, depth = 0) {
  if (depth > 3) {
    return { isError: true, content: [{ type: 'text', text: 'Max reasoning depth exceeded.' }] };
  }
  const classification = await serverInstance.createMessage({ ... });
  if (needsMoreInfo(classification)) {
    return toolHandler({ query: refineQuery(query) }, context, depth + 1);
  }
  return finalResponse(classification);
}

Tool Calling in Sampling Requests

New in 2025-11-25

Starting with spec version 2025-11-25, servers can include tools and toolChoice parameters in a sampling/createMessage request. This lets the server constrain which tools the LLM may call during the sampling turn. Without this, the LLM during sampling would either have no tools at all or the full tool set – there was no way for the server to scope the tools available during a recursive inference.

// Server: sampling request with constrained tool set
const response = await serverInstance.createMessage({
  messages: [{
    role: 'user',
    content: {
      type: 'text',
      text: 'Look up the current status of order ORD-12345 and summarise it.',
    },
  }],
  maxTokens: 500,
  tools: [
    {
      name: 'get_order_status',
      description: 'Look up the current status of an order by ID',
      inputSchema: {
        type: 'object',
        properties: {
          orderId: { type: 'string', description: 'The order ID' },
        },
        required: ['orderId'],
      },
    },
  ],
  toolChoice: { type: 'auto' },  // 'auto' | 'none' | { type: 'tool', name: '...' }
});

The tools array defines the tool definitions available during this specific sampling turn. The toolChoice parameter controls how the LLM selects tools: "auto" lets the model decide, "none" disables tool use entirely, and { type: 'tool', name: 'get_order_status' } forces a specific tool. This is useful when a server needs the LLM to do a lookup-then-reason task: you provide only the lookup tool, the LLM calls it, gets the data, and writes a summary.

The client is responsible for actually executing the tool calls the LLM makes during sampling. The client returns the final assistant message to the server, including any tool results in the conversation. This keeps the server out of the tool execution loop during its own sampling request – the client manages the entire multi-turn tool-use conversation internally.

What to Check Right Now

Declare sampling on your client – if you want servers to be able to use sampling, your client must declare capabilities: { sampling: {} }. Without this, sampling requests from servers will be rejected.
Implement a sampling handler – if you build a host application, implement the CreateMessageRequestSchema handler. An unimplemented handler will cause all sampling requests to fail silently.
Show sampling requests to users – the spec emphasises human oversight. Production hosts should surface pending sampling requests to users and allow approval/rejection.
Cap sampling depth – any server that uses sampling recursively must have a maximum depth limit. Without it, one malformed query can run up unbounded costs.

nJoy 😉

Lesson 9 of 55: MCP Sampling – Server-Initiated LLM Calls and Recursive AI