Choosing the Right Model: A Decision Framework for MCP Applications

Every team building MCP applications eventually faces the same question: which model should we use for this task? The wrong answer is “the most capable one” — that is how teams burn through their budget on GPT-4o for queries that GPT-4o mini could answer just as well. This lesson builds a systematic decision framework: a set of questions and criteria that map task characteristics to optimal model choices, plus the infrastructure to implement dynamic routing in production.

Decision framework flowchart for selecting between OpenAI Claude Gemini models based on task type dark
Model selection is a routing problem: match task characteristics to the cheapest model that meets quality requirements.

The Five Dimensions of Model Selection

Five dimensions determine the optimal model choice for an MCP task:

  1. Reasoning depth: Is the task multi-step, requires planning, or involves complex logic? Use Claude 3.7 or o3. Is it a simple lookup or classification? Use a mini/flash model.
  2. Context length: Does the task involve large documents, entire codebases, or long conversation history? Gemini 2.5 Pro (1M tokens) or Claude 3.7 (200K). For standard tasks, 128K is sufficient.
  3. Input modality: Does the task involve images, PDFs, or audio? Use Gemini (strongest multimodal support). Text only – any provider works.
  4. Output format: Does the task require guaranteed JSON schema output? Use OpenAI with zodResponseFormat. Free-form prose or code? Any provider.
  5. Volume and cost: Is this a high-throughput task called thousands of times per hour? Use Gemini 2.0 Flash ($0.075/1M input) or GPT-4o mini ($0.15/1M input) before considering more expensive models.

The Decision Framework

// Task routing decision table
// Use this as a starting point for your routing config

const ROUTING_RULES = [
  // Rule order matters - first match wins
  {
    name: 'multimodal',
    condition: (task) => task.hasImages || task.hasPDF || task.hasAudio,
    provider: 'gemini', model: 'gemini-2.0-flash',
    reason: 'Native multimodal support, cheapest multimodal option',
  },
  {
    name: 'large-context',
    condition: (task) => task.estimatedInputTokens > 100_000,
    provider: 'gemini', model: 'gemini-2.5-pro-preview-03-25',
    reason: '1M token context window, best for whole-document/codebase analysis',
  },
  {
    name: 'deep-reasoning',
    condition: (task) => task.requiresPlanning || task.complexity === 'high',
    provider: 'claude', model: 'claude-3-7-sonnet-20250219',
    reason: 'Extended thinking mode, best instruction following',
  },
  {
    name: 'structured-output',
    condition: (task) => task.requiresStrictJSON,
    provider: 'openai', model: 'gpt-4o',
    reason: 'zodResponseFormat guarantees JSON schema adherence',
  },
  {
    name: 'high-volume-simple',
    condition: (task) => task.volume > 1000 && task.complexity === 'low',
    provider: 'gemini', model: 'gemini-2.0-flash',
    reason: 'Cheapest per-token, sufficient for simple tasks at scale',
  },
  {
    name: 'default',
    condition: () => true,
    provider: 'openai', model: 'gpt-4o-mini',
    reason: 'Good balance of capability and cost for general tasks',
  },
];

export function selectModel(task) {
  const rule = ROUTING_RULES.find(r => r.condition(task));
  return { provider: rule.provider, model: rule.model, reason: rule.reason };
}
Model selection matrix showing task complexity vs cost tradeoff with provider recommendations in each quadrant dark
The cost-capability matrix: route high-volume simple tasks to cheap models and complex reasoning to capable models.

Estimating Task Complexity

// Simple heuristics for runtime complexity estimation
export function classifyTask(userMessage, context = {}) {
  const words = userMessage.split(/\s+/).length;
  const hasAnalyze = /analyz|evaluate|compare|assess|plan|strategy/i.test(userMessage);
  const hasSimple = /list|find|get|show|what is|how many/i.test(userMessage);

  return {
    complexity: hasAnalyze ? 'high' : (hasSimple ? 'low' : 'medium'),
    estimatedInputTokens: Math.ceil(words * 1.3) + (context.historyTokens ?? 0),
    hasImages: context.hasImages ?? false,
    hasPDF: context.hasPDF ?? false,
    hasAudio: context.hasAudio ?? false,
    requiresStrictJSON: context.requiresStrictJSON ?? false,
    requiresPlanning: hasAnalyze,
    volume: context.requestsPerHour ?? 0,
  };
}

Cascading Fallback Strategy

// Try primary, fall back on quota or severe errors
export async function runWithFallback(task, providers) {
  const { provider: primaryKey, model } = selectModel(task);
  const fallbackKey = primaryKey === 'gemini' ? 'openai' : 'gemini';

  for (const key of [primaryKey, fallbackKey]) {
    const provider = providers[key];
    if (!provider) continue;
    try {
      return await provider.run(task.message, task.mcpClient);
    } catch (err) {
      const isQuota = err.status === 429 || err.message?.includes('RESOURCE_EXHAUSTED');
      if (!isQuota) throw err;
      console.error(`[router] ${key} quota hit, trying fallback`);
    }
  }
  throw new Error('All providers exhausted');
}

Building a Cost Dashboard

// Track cost per provider, per task type, per hour
class CostTracker {
  #records = [];

  record({ provider, model, inputTokens, outputTokens, taskType }) {
    const costs = {
      'gpt-4o': { input: 2.5, output: 10 },
      'gpt-4o-mini': { input: 0.15, output: 0.60 },
      'claude-3-7-sonnet-20250219': { input: 3.0, output: 15 },
      'claude-3-5-haiku-20241022': { input: 0.80, output: 4 },
      'gemini-2.0-flash': { input: 0.075, output: 0.30 },
      'gemini-2.5-pro-preview-03-25': { input: 1.25, output: 10 },
    };
    const c = costs[model] ?? { input: 0, output: 0 };
    const cost = (inputTokens * c.input + outputTokens * c.output) / 1_000_000;
    this.#records.push({ provider, model, taskType, cost, ts: Date.now() });
  }

  summary() {
    return this.#records.reduce((acc, r) => {
      const key = `${r.provider}/${r.model}`;
      acc[key] = (acc[key] ?? 0) + r.cost;
      return acc;
    }, {});
  }
}

Common Routing Mistakes

  • Always routing to the most capable model: GPT-4o for every query is 16x more expensive than GPT-4o mini for tasks where both work equally well. Benchmark first, then route based on evidence.
  • Not accounting for caching: OpenAI’s automatic caching and Claude’s explicit cache_control can change the effective cost dramatically for repeated queries with the same prefix. Factor this into your cost model.
  • Routing on task type without measuring quality: A routing decision is only valid if you have measured that the cheaper model produces acceptable results for the task type. Build eval sets per task type and validate routing assumptions.
  • Ignoring latency: Cost is not the only dimension. GPT-4o mini has much lower latency than GPT-4o. Gemini 2.0 Flash is faster still. For user-facing real-time features, latency matters as much as cost.

What to Build Next

  • Run 20 real queries from your application through the framework above. Log provider, model, task complexity, cost, and a quality score (manual review). Use this data to refine the routing rules.
  • Set up a cost alert: if hourly spend exceeds a threshold, log a warning and automatically down-route to cheaper models.

nJoy 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.