Why LLMs Hallucinate: The Probabilistic Root Cause

LLMs hallucinate because they’re not “looking up” facts, they’re predicting the next token. The training objective is to assign high probability to plausible continuations given context. Plausible doesn’t mean true: the model has learned patterns like “the capital of X is Y” and “according to study Z,” so it can generate confident, grammatical, and completely false statements. There’s no separate “truth check” in the forward pass; the only signal is statistical.

Why it happens: the model has seen many texts that look authoritative (wrong Wikipedia edits, forum posts, confabulations in training data). It has also learned that sounding confident is rewarded in dialogue. So when it doesn’t know, it often still produces something that “fits” the context and the prompt. Low-probability tokens can still be sampled (especially at higher temperature), so rare or wrong answers can appear. And the model has no persistent memory of “I already said X”, it can contradict itself in the same conversation.

Mitigations are external: RAG (retrieve real docs and put them in context), tool use (call an API or DB instead of inventing), structured output (force a schema so the model has to fill slots), and post-hoc checks (fact-check, cite sources). You can also reduce temperature and use decoding constraints to make the model more conservative, but that doesn’t remove the underlying cause.

Understanding the probabilistic root cause helps you design systems that don’t over-trust the model. Never treat raw model output as ground truth for facts, names, or numbers. Always have a path to verify or ground.

Expect continued work on “truthfulness” and citation in models, but the fundamental issue, next-token prediction is not truth-tracking, will stay. Design around it.

nJoy πŸ˜‰

Composing AI Systems: MCP and A2A Orchestration Patterns

Once you have MCP for tools and something like A2A for agent-to-agent handoffs, the next step is composing them: one system that uses both. For example, your main agent might call MCP tools (search, DB, API) and also delegate sub-tasks to other agents via A2A. The orchestrator (often an LLM or a small controller) decides when to use a tool and when to call another agent, and it merges results back into the main flow.

Orchestration patterns vary. You can have a single orchestrator that never delegates, it only uses MCP tools. You can have a tree: the top agent delegates to specialists, and each specialist might use tools or delegate again. You can have a pipeline: agent A does step 1, hands to B for step 2, then to C for step 3. The choice depends on the task. Composing MCP and A2A means the orchestrator has two kinds of “actions”: call a tool (MCP) or delegate to an agent (A2A). Both return something the orchestrator can reason over.

Design questions: who owns the overall goal and context? Usually the top-level agent or the host. How do you avoid infinite delegation? Step limits, timeouts, and clear contracts (e.g. “return by time T”). How do you handle partial failure? One agent might fail; the orchestrator needs a strategy (retry, fallback, or abort and report).

In production you’ll also care about observability (tracing which agent did what), cost (each agent call may hit an API), and security (which agents can call which tools or other agents). Composing MCP and A2A gives flexibility; making that composition reliable and debuggable is the next layer of work.

Expect more frameworks that support both MCP and A2A in one stack, and clearer patterns for when to delegate versus when to call a tool.

nJoy πŸ˜‰

A2A Protocol: How AI Agents Negotiate and Delegate

A2A stands for Agent-to-Agent protocol. It’s a way for one AI agent (or assistant) to delegate work to another, for example, a “planner” agent that hands off a research task to a “researcher” agent and then gets a result back. The idea is to standardise how agents discover each other, describe capabilities, and exchange requests and responses so that multi-agent systems can be composed without every team building custom glue.

In practice A2A defines how an agent advertises what it can do (e.g. “I can answer questions about X” or “I can run tool Y”) and how another agent (or orchestrator) sends a request and gets a response. That might be over HTTP with a standard envelope: request ID, from/to, payload, and maybe a deadline or priority. The response might be a direct answer or a reference to a longer-running task that the caller can poll for.

Negotiation and delegation mean the caller can choose among several agents (e.g. by capability or load) and the callee can accept, reject, or redirect. That’s useful when you have multiple specialised agents and want the system to route work automatically. Trust and auth are part of the design: the caller needs to know the callee is who it claims to be, and the callee may enforce quotas or policies.

A2A is still emerging; different frameworks and vendors are proposing similar ideas. The core value is the same as MCP for tools: one protocol so that agents from different builders can work together. Expect to see A2A (or something like it) in products that position themselves as “agent platforms” where you plug in multiple agents and an orchestrator routes and composes them.

Expect more formal specs and implementations. The real test is whether multiple vendors and open-source projects adopt the same protocol.

nJoy πŸ˜‰

What Makes a Good MCP Server: Tools, Resources, and Prompts

A good MCP server does three things well: it exposes tools that are easy for the model to use, it offers resources that add real context, and (optionally) it provides prompts that steer the model toward correct usage. Tools should have clear names and descriptions and arguments that match what the model can reason about. Vague tools (“do stuff”) or huge argument schemas lead to misuse or confusion. Small, focused tools (e.g. “search_docs”, “run_query”) work better.

Resources are read-only inputs: the model (or host) can request a URI and get content. That’s useful for injecting docs, config, or live data into context without the model “calling” something. Design resources so that the URI is predictable and the content is concise; giant dumps hurt. Prompts are optional templates (e.g. “summarise this resource”) that the host can offer to the model. They help when the model needs a nudge toward a specific use of the server.

Error handling matters: when a tool fails, return a clear message so the model can retry or explain. Timeouts and rate limits should be visible to the model when possible (“Tool X is rate limited; try again in N seconds”) so it can adapt. Auth is the host’s job in most setups, the server trusts the host, but if your server has its own auth, document it so host authors can configure it.

Keep the server stateless where you can: each request should be self-contained so you can scale and restart without losing context. If you need state (e.g. a session), make it explicit (e.g. a session_id argument) rather than implicit.

A good MCP server feels like a sharp, predictable API that the model can actually use. Clarity and scope beat feature count.

nJoy πŸ˜‰

MCP Explained: The USB Standard for AI Tools

MCP stands for Model Context Protocol. In practice it’s a standard way for an AI application (a host) to talk to tools and data sources (servers). A host, Cursor, Claude Desktop, or a custom app, discovers and calls MCP servers over a defined transport (e.g. stdio or HTTP). Each server exposes tools (actions the model can request), resources (read-only data the model can pull in), and optionally prompts (templates). So instead of every app inventing its own plugin system, MCP gives you one protocol: like USB for AI tools.

From the host’s point of view you add a server by config (name, transport, args). The host then fetches the server’s capability list: what tools exist, what arguments they take, what resources are available. When the model wants to call a tool, the host sends the request to the right server and passes the result back to the model. The model never talks to the server directly; the host is the broker. That keeps auth, rate limits, and safety in one place.

Servers can be local (a process on your machine) or remote (HTTP). They can wrap existing APIs, file systems, or custom logic. The protocol is transport-agnostic so you can run the same server over stdio in dev and HTTP in production. Tool schemas are JSON-based so they’re easy to generate and validate.

MCP doesn’t solve everything, you still have to build and secure the servers, and the model still has to choose the right tool and arguments. But it solves the “how do we plug tools in?” problem in a way that’s reusable across hosts and models. That’s why it’s spreading: one protocol, many apps and servers.

Expect more MCP servers in the wild and tighter integration in major AI products. The value is in the ecosystem, not any single implementation.

nJoy πŸ˜‰

Agent Memory: Short-Term Context, Long-Term State, and the Gap Between

Agents have two kinds of “memory”: the context window (short-term) and everything else (long-term). Short-term is what you send in each request, the conversation so far, maybe a summary of older turns, plus any retrieved docs or tool results. That’s limited (e.g. 128K tokens) and expensive. Long-term would be a persistent store: facts about the user, past decisions, or project state that survives across sessions. Today most agents don’t have a real long-term memory; they get a fresh context each time or a hand-built “summary” that you inject.

The gap shows up when you want an agent that remembers your preferences, what it did last week, or the current state of a long project. Without long-term memory, you have to tell it again or rely on RAG over past transcripts. That works up to a point, but retrieval isn’t the same as “knowing”: the model might not get the right chunk or might contradict what it “remembered” before. True long-term memory would mean the agent updates a store (e.g. a knowledge graph or structured DB) and reads from it at the start of each run, still an open design problem.

Short-term is also a design choice: do you keep every message, or do you summarise old turns to save space? Summarisation loses detail; keeping everything hits context limits. Many systems use a sliding window plus a running summary. Tool results can be truncated or summarised too, so the model sees “the answer was X” instead of a 10K-character dump.

Until we have standard, reliable long-term memory, agents will stay best at single-session or well-scoped tasks. The progress will come from better retrieval, better summarisation, and eventually learned or hybrid memory that the agent can read and update safely.

Expect more work on agent memory architectures and on grounding agents in external state (databases, docs) as a stand-in for true long-term memory.

nJoy πŸ˜‰

Multi-Agent Systems: Coordination, Trust, and Failure Modes

Multi-agent systems use more than one agent (or role) to accomplish a task. One might do research, another write code, another review; they hand off work or debate until they converge. The appeal is specialisation and checks-and-balances: different “brains” for different steps, and one can catch another’s mistake. The cost is coordination: who does what, when do you stop, and how do you avoid loops or contradictions.

Coordination patterns vary. You can have a supervisor that assigns subtasks to worker agents and merges results. You can have a debate: two agents argue for different answers and a third decides. You can have a pipeline: agent A’s output is agent B’s input. The common thread is that each agent has a clear role and a defined interface (input/output or shared state), and the system has a protocol for handoffs and termination.

Trust is tricky. You’re still relying on LLMs to follow the protocol, stay in role, and not hallucinate or contradict each other. In practice you need guardrails: max steps, validation of handoffs, and sometimes human approval for high-stakes steps. Failure modes include infinite loops (agents keep delegating), conflicting answers (no clear winner), and one agent undoing another’s work.

Multi-agent setups are most useful when the task naturally splits (e.g. research + synthesis + writing) or when you want redundancy (e.g. two agents propose, one adjudicates). For many use cases a single capable agent with good tools is simpler and easier to debug.

Expect more frameworks and patterns for multi-agent coordination, and clearer guidance on when the extra complexity pays off.

nJoy πŸ˜‰

Tool Use and Function Calling: From Simple APIs to Full Autonomy

Tool use is how an LLM goes from “text in, text out” to “text in, call an API, get data, then text out.” The model is given a list of tools (name, description, parameters) and can request a call; your code runs the call and returns the result to the model. That’s the basis of search assistants, code runners, and custom integrations. Function calling is the same idea with a more formal name: the model outputs a structured call (function name + arguments) instead of free text, which makes parsing and execution reliable.

Most APIs (OpenAI, Anthropic, Google) support a tools or functions parameter: you pass a schema (e.g. JSON Schema for the arguments), and the model responds with something like tool_calls: [{ name: “get_weather”, arguments: { “city”: “London” } }]. You execute that, then send the result back in a follow-up message. The model can then call another tool or give a final answer. The loop is: user message β†’ model (maybe with tool_calls) β†’ you run tools β†’ you send tool results β†’ model again.

Good tool design matters: clear names, concise descriptions, and well-scoped parameters. Too many tools or vague descriptions and the model gets confused; too few and it can’t do the job. You also need to handle errors (tool failed, timeout) and sometimes rate limits or auth. Passing back “Tool failed: …” as the observation lets the model retry or explain the failure to the user.

From simple APIs (one or two tools) to full autonomy (dozens of tools, multi-step plans), the same pattern holds: the model decides when and how to call, your code enforces safety and runs the call. That separation keeps the LLM in the “reasoning” role and keeps dangerous or privileged actions under your control.

Expect more standardisation (e.g. OpenAPI-based tool discovery) and better models that follow tool schemas more reliably.

nJoy πŸ˜‰

The ReAct Pattern: Think, Act, Observe β€” How Agents Reason

ReAct stands for Reason, Act, Observe. It’s a pattern for getting LLMs to use tools in a structured way: the model outputs a thought (reasoning), then an action (e.g. call this tool with these args), then you run the action and feed the result back as the observation. The model then reasons again and either takes another action or gives a final answer. That alternation of reasoning and acting reduces the chance the model “hallucinates” a tool result or skips a step.

In practice you prompt the model with a format like “Thought: … Action: tool_name(args). Observation: …” and parse the output to extract the action. You execute the action (API call, code run, search), append “Observation: <result>” to the conversation, and send it back. The model’s next turn is another Thought and possibly another Action. You keep going until the model outputs “Final Answer: …” or you hit a step limit.

ReAct doesn’t require special training, it’s a prompting and parsing discipline. Many agent frameworks (LangChain, AutoGPT-style setups, and custom loops) implement something like it. The key is that the model “sees” the result of each action before deciding the next one, so it can recover from mistakes and chain multiple tools.

The downside is token cost: every thought and action is in the context, so long tasks can blow up context length. You often summarise or truncate old steps. Also, the model can still output malformed actions or infinite loops, so you need timeouts and retries.

ReAct is one of the main patterns that make agents practical today. Expect it to evolve into more structured formats (e.g. JSON actions) and tighter integration with function-calling APIs.

nJoy πŸ˜‰

What Is an AI Agent? Beyond the Chatbot

An AI agent is a system that doesn’t just answer one question and stop, it takes a goal, chooses actions (often by calling tools or APIs), observes what happened, and loops until the goal is met or it gives up. That’s a different beast from a chatbot that runs one prompt and returns one reply. Agents are why you can say “summarise my inbox and draft three replies” and the system actually opens the inbox, reads, calls the LLM, writes drafts, and shows them to you.

The core loop is simple: state (goal, context) β†’ decide (what to do next) β†’ act (run a tool or ask the LLM) β†’ observe (get tool output or model output) β†’ update state β†’ repeat. The hard part is making “decide” and “act” robust: the model has to pick the right tool, pass the right arguments, handle errors, and not loop forever. That’s where frameworks and patterns like ReAct, tool-calling APIs, and guardrails come in.

Agents can be single-LLM (one model that reasons and picks tools) or multi-agent (several specialised models or roles that hand off to each other). They can have access to search, code execution, file systems, or custom APIs. The unifying idea is autonomy within a defined scope: you set the goal and the rules, the agent does the steps.

Failure modes are real: the model might call the wrong tool, misparse output, or spin on a task without making progress. Production agents need timeouts, step limits, and human-in-the-loop checkpoints for anything irreversible. The best agents are also observable, you can see the reasoning and tool calls so you can debug and improve them.

Agents are where LLMs move from “helpful assistant” to “autonomous worker.” Expect more tooling, better tool schemas, and clearer patterns for when to use agents versus a single call.

nJoy πŸ˜‰