Models plan; code executes. Every agent therefore needs a loop: the model proposes an action, something runs it, the result comes back into context, and the cycle repeats until the job is finished. That structure is not negotiable. What is negotiable is where the loop runs: in your service, hopping back and forth over HTTP between your code and the model API, or inside the inference provider’s runtime, hidden behind what looks like a single API call.
The distinction matters more than it appears in documentation. Local loops give you full visibility: every hop in your logs, your own timeouts, breakpoints at the tool boundary. Managed loops shrink your codebase and eliminate the runner boilerplate, but move tool execution out of your sight. Choose wrong and you get agents that behave perfectly in development, then stall in production with cold-start latency, silent empty tool results, or confident-sounding replies that never acknowledge a failed search.
This article walks through both patterns using concrete chat examples: what the user sees, what your application actually does, where inference runs, and where the goal gate decides the loop is finished.
Left: your application orchestrates every LLM call and tool hop. Right: one API call, and the provider runs the inner loop on your behalf.
The Goal Gate: How Any Loop Actually Terminates
Before comparing local and managed patterns, it is worth establishing what they share: both need a rule that says “we are done now.” Without it, the agent either loops forever or returns mid-thought. In code, that termination rule usually looks like one of three things:
The model returns text with notool_calls – treat that as the final answer and exit.
A hard step counter hits max_steps – stop, apologise, and escalate rather than burn budget silently.
A lightweight judge – a rules check, a small classifier, or a second LLM call – decides the original question is satisfied, even if the model would otherwise request another tool.
This is the goal gate: the point after each tool result where something asks “are we done yet?” Local and managed loops differ in who runs that gate – your for loop versus the provider’s runtime – not in whether it exists.
Local Loop: Your Application Orchestrates Every Hop
In the local pattern, your code owns the loop entirely. From the user’s perspective, the conversation looks normal. Behind the scenes, your service may call the LLM two or three times and execute one or more tools before a single assistant message reaches the chat UI.
Example: “What’s the status of order 8842?”
What the user types:
User: What's the status of order 8842?
Hop 1 – your app sends the conversation to the LLM:
// Messages sent to the model API:
[
{ "role": "system", "content": "You help with orders. Use lookup_order when needed." },
{ "role": "user", "content": "What's the status of order 8842?" }
]
// The model does not answer yet. It requests a tool:
{
"content": null,
"tool_calls": [{
"name": "lookup_order",
"arguments": { "order_id": "8842" }
}]
}
Hop 2 – your application runs the tool. Not the LLM. Your code:
const result = await db.query(
"SELECT status, carrier, eta FROM orders WHERE id = $1",
["8842"]
);
// { status: "shipped", carrier: "DHL", eta: "2026-06-26" }
Hop 3 – your app sends the tool result back to the LLM:
// The conversation now includes the tool output:
{ "role": "tool", "tool_call_id": "...", "content": "{\"status\":\"shipped\",...}" }
// The model synthesises a final answer:
{
"content": "Order 8842 shipped via DHL and should arrive on 26 June.",
"tool_calls": []
}
Goal gate check:tool_calls is empty and content is non-empty. Loop exits. Your app sends one message to the chat UI.
What the user sees:
Assistant: Order 8842 shipped via DHL and should arrive on 26 June.
One user turn. Two LLM round-trips. One database query. Your metrics should capture all three latency contributions separately, not lump them into a single “response time” figure.
Example: Multi-tool research (three hops, one visible reply)
User: Compare our refund policy with what competitor X published last month.
What the chat transcript shows the user: two messages. What the internal loop does:
LLM requests fetch_url with the competitor's pricing page URL. Your scraper runs; returns markdown.
LLM requests search_knowledge_base("refund policy"). Your vector DB runs; returns a chunk.
LLM synthesises both results into prose. No further tools. Goal gate passes.
Each hop adds wall-clock time. In local mode, you log them individually: llm_hop1_ms, tool_fetch_url_ms, tool_search_kb_ms, llm_hop2_ms. Without that granularity, you cannot tell which hop caused a slow reply.
The local loop in code
async function runAgent(messages, tools, maxSteps = 8) {
for (let step = 0; step < maxSteps; step++) {
const res = await llm.chat({ messages, tools });
// Goal gate: model returned a final answer.
if (!res.tool_calls?.length) {
return res.content;
}
// Execute each requested tool locally and append results.
for (const call of res.tool_calls) {
const out = await executeLocally(call);
messages.push(toolResult(call.id, out));
}
}
// Goal gate never opened within the step budget.
throw new Error("Max steps reached - goal gate never triggered");
}
Notice the step cap is a hard limit, not a best-effort suggestion. An agent without it on open-ended research tasks will happily call tools until your API budget runs out.
Managed Loop: The Provider Runs the Inner Cycle
The managed pattern is the same logical loop, but the provider executes it. Your application sends one request; the provider calls the model, dispatches tools, feeds results back to the model, and returns only when the goal gate inside its runtime says the task is complete. From your code's point of view, a multi-hop agent interaction looks like a single blocking API call.
Example: "What's the status of order 8842?" (managed)
Your application sends one request:
const res = await llm.chat({
messages: [
{ role: "system", content: "You help with orders." },
{ role: "user", content: "What's the status of order 8842?" }
],
tools: [{
type: "mcp",
server_url: "https://tools.example.com/mcp",
allowed_tools: ["lookup_order"]
}]
});
// Your code blocks here while the provider runs the loop internally.
Inside the provider runtime (visible only if you enable tracing):
// Conceptual provider trace:
[1] Model requests: tool_call lookup_order("8842")
[2] Provider connects to your MCP server, runs the tool, gets JSON
[3] Model synthesises: "Order 8842 shipped via DHL..."
[4] Goal gate: no further tool calls -> return response to client
What the user sees: the same reply as the local version. The difference is that your MCP server must be publicly reachable from the provider's network, and empty tool results never pass through your code - so you need system prompt guardrails to handle them rather than an if statement in your runner.
Example: Web search agent (managed, serial tools)
User: Summarise Kubernetes networking changes in the last six months.
// Your app sends:
tools: [{ type: "web_search", max_uses: 4 }]
// Provider internally runs: search -> search -> fetch -> synthesise.
// User sees one assistant message after 3-8 seconds total.
In a chat UI this feels like a slightly slow reply. Without tracing enabled on the provider side, you cannot distinguish whether latency came from inference, a second search, or a cold MCP connector start-up.
Serial tool calls inside a managed loop add directly on top of model time. Each hop is invisible in your application logs unless you enable provider tracing.
Side-by-Side: Same Chat, Different Plumbing
Question
Local loop
Managed loop
Who calls the LLM?
Your application, once per hop
Provider runtime, inside one request
Who runs tools?
Your process / your MCP client
Provider runtime / remote MCP server
Who implements the goal gate?
Your loop + max_steps
Provider runtime limits
What the user sees
One assistant reply (usually)
One assistant reply (usually)
Primary debugging surface
Your logs and breakpoints
Provider trace API
Capability Buckets (Vendor Names Differ, Patterns Do Not)
Managed loop providers expose different tool types under different names, but they all map to the same four buckets:
Managed search/fetch - model-triggered web retrieval; the provider owns the search stack.
Private corpus lookup - RAG inside the same inference request, against your indexed documents.
Remote MCP - your HTTPS tool server; the provider connects to it and runs your allowlisted tools.
Deferred tool schemas - tool definitions loaded on demand, not upfront. Past roughly 25 tools, this reduces input token cost meaningfully.
"MCP (Model Context Protocol) is an open-source standard for connecting AI applications to external systems." — Model Context Protocol, Introduction
Latency: Count Every Hidden Hop
Case 1: Local tools can be nearly free
Not all tools are slow. Knowing the rough cost of each type helps you decide what to keep local and what can move to managed:
In-process validation or lookup: sub-millisecond
Local MCP server over stdio: ~1 ms when the process is warm
Remote HTTP API: 50-300+ ms depending on geography and cold start
Cold MCP server boot: potentially several seconds on first request
Case 2: Managed loops hide serial stack time
When the provider runs two or three tools serially inside your request, the total latency is the sum of all inference calls plus all tool calls. This adds up quietly:
// Measure before committing to managed mode:
const budgetMs = 800; // acceptable TTFT for chat UX
const modelMs = 350; // per inner LLM call
const innerLlmCalls = 2; // typical for one tool use
const toolMs = 280; // per tool round-trip
const totalMs = modelMs * innerLlmCalls + toolMs;
// 980ms - over budget.
// Options: keep tools local, parallelise, or reduce max_uses.
Case 3: Silent partial failures in managed mode
When a local tool throws an error, your code catches it and can respond explicitly: "I could not reach the order database - please try again." In managed mode, the model receives an empty or error result and may synthesise a confident-sounding answer anyway. Guard against this with a system-prompt rule:
// Add to system prompt for managed tool agents:
"If any tool returns no data or an error, say so explicitly in your reply.
Do not infer or invent the answer. Ask the user to rephrase or try again."
Failures You Still Own in Managed Mode
Write idempotency. Read tools retry cleanly on failure. Write tools - creating a record, sending an email, placing an order - need deduplicate keys to avoid double execution when the provider retries internally.
Network reachability. Managed MCP servers must be publicly reachable from provider egress IPs. Tools on private networks or that require local session state cannot move to managed mode without a proxy layer.
Alert granularity. One "agent slow" alert is not useful. Separate your alerting: provider latency versus your MCP server cold-start versus model output quality degradation. Each has a different fix.
When to Stay Local
During development - you need request-level payload logs and the ability to set breakpoints at the tool boundary.
Sub-second chat or voice UX where every millisecond of managed overhead is visible to users.
Tools on private networks or requiring local session state that the provider cannot reach.
Custom goal gates: compliance judges, human-in-the-loop approval steps, or application-specific termination logic that the provider's runtime cannot implement.
When Managed Makes Sense
Standard search and retrieval where you do not want to run your own search infrastructure.
Many separate agents sharing one MCP tool surface - the provider handles routing, not you.
Tool credentials that should not live in application repositories.
Teams that prefer paying with a tracing API subscription rather than maintaining a loop runner.
Practical Migration Path
Build local first. Log tool name, argument size, latency, and result size per hop from day one.
Classify your tools. Tag each one: read vs write, local vs remote, latency-sensitive vs batch-tolerant.
Move read-heavy tolerant tools managed first. Keep write tools local until you have idempotency keys and retry semantics confirmed.
Enable provider tracing before cutover and compare inner-hop counts against your local logs. If the counts diverge, investigate before going live.
Swarms to Keep an Eye On
This article covers the single-agent loop: one context window, local or managed tools, serial hops. That is the right foundation for most production chat agents. When tasks decompose into parallel research lanes, or when you need orchestrator-worker topologies and peer handoffs between specialists, you are entering multi-agent territory with different cost structures and failure modes.
For the full evolution map (single-shot LLM through ReAct loop, agent stack, and orchestrator swarms), topology comparisons, and a 2026 watchlist covering Claude Research, the OpenAI Agents SDK, LangGraph, and CrewAI, read the companion piece:
Draw your loop on paper with numbered hops and the goal gate labelled. If you cannot draw it, your production agent is opaque.
Split your latency metrics: model time versus tool time per user-visible reply. An aggregate response-time figure hides which hop is slow.
Measure MCP cold-start from a fresh process, not from a warm re-use. That is what a first-morning user experiences.
Audit write tools for idempotency keys before moving any of them to managed mode.
Tool count check: past roughly 25 tools, evaluate deferred schema loading to reduce input token cost.
Managed execution shrinks code; it does not shrink accountability. The chat looks the same to the user either way. What changes is whether you can see the loops between their question and the assistant's answer - and whether you find out about failures before or after they do.