16 levels of building an MCP
Quick Map
For
- MCP builders
- agent platform teams
- API product owners
You'll see
- MCP maturity levels
- Why tool design matters
- How MCP improves agent behavior

How do you rein in a hallucinating genius?
Building an MCP server has levels.
At the beginning, it looks almost too easy - you have a REST API. You wrap each endpoint with an MCP tool. You point an agent at it. The agent calls the tool. Something useful happens.
Beautiful.
And then you watch in horror as the agent confidently selects the wrong tool, passes the wrong argument, forgets the thing you told it five minutes ago, and you realize this is not just an integration problem.
Ah.
It is an interface design problem.
MCP is not only a transport for calling APIs. It is a bridge between deterministic software systems and non-deterministic agents that do not think like humans, do not read docs like humans, and do not always remember the parts of the context you hoped they would remember.
So here is my current ladder for building an MCP server. A ladder is not the best analogy. It is more like an RPG character levelling up. You unlock an ability, discover your earlier choices were wrong, respec a few tools, and keeps getting better.
Sorry this list ended up so long. It may be better to treat it as a checklist and compare it to your own experience. Feel free to use an agent to summarize this and compare it against your project.
Level 0: You have an API or CLI
❌ Don’t: treat API, CLI, and MCP as the same interface with different transport.
✅ Do: keep API for deterministic software, CLI for humans and agents, and MCP for agents.
Before MCP, there is usually already something there. Maybe it is a REST API. Maybe it is a CLI. Maybe it is a pile of scripts one senior engineer knows how to run safely. That layer still matters. Direct API is great when another system already knows exactly what it wants.
A CLI is different again. You are writing for humans and increasingly for agents. Humans want readable commands, help, examples, pipes, flags, and good terminal output. Agents can use that too.
But when you write an MCP, you are writing only for agents.
The human benefits from it. The human may approve actions through it. The human may see an MCP App. But the interface itself is for the agent.
MCP does not replace API or CLI. MCP sits above them when the caller is an agent that needs discovery, structure, safety, and guidance.
You are 100% focused on building the most excellent, dedicated interface for agents to wow the user. Every time.
If you build MCP like just another API client, you will stop too early. An MCP is never built once, deployed, and forgotten.
Level 1: You wrap your REST API with MCP
❌ Don’t: expose every endpoint and call the job done.
✅ Do: start small, then treat schemas, parameters, examples, and return shapes as the agent UX.
This is where most of us start. And it is a good start.
You take each REST endpoint and expose it as a tool:
list_flowsget_flowcreate_flowupdate_flowdelete_flow
You learn the scaffold. You see agents discover tools, call them, and handle structured input/output.
Then you watch the agent read those tools and somehow pick the wrong one. It edits when it should inspect.
This is also where you learn a painful truth: schema quality matters more than model quality.
“Prompt engineering” becomes “tool schema engineering”. Descriptions, parameters, examples, defaults, and output shape are how the agent sees the world. Level 1 proves the bridge works. But the Level 1 bridge is not that good.
Level 2: You add skills
❌ Don’t: assume a wall of instructions will fix confusing tools.
✅ Do: add short skills that teach the agent the workflow, the danger zones, and the right call shape.
So you add skills. Surely this will fix it. You write instructions: when to use each tool, what order to call them in, what is dangerous, how to recover, and what the right call shape looks like.
This helps. For a while.
Then you discover the next problem: skills are text. Text has gravity. It fills context and competes with the user request, tool descriptions, history, and plan. If the skill is too short, it misses platform behavior. If it is too long, the agent skims, forgets, or never loads the part that mattered.
Skills are useful. They are not magic. They teach the agent, but they do not repair a confusing tool surface.
Level 3: You add annotations
❌ Don’t: make the agent infer safety from a tool name.
✅ Do: annotate tools so read, write, destructive, expensive, and approval-required behavior is visible.
Next, you start adding metadata. You annotate tools with hints: read-only, mutating, destructive, idempotent, expensive, or approval-required.
Now your MCP starts to feel less like a box of buttons and more like an interface with warning labels. The agent should not have to infer danger from names alone. Separate read and write operations early. Agents need clear boundaries.
Annotations are not only for the model. They are also for the host: confirmation prompts, warnings, collapsed results, and tool preference. This is when you love your users and let them safely say “approve all read tools”.
Level 4: You rename your tools
❌ Don’t: use names that match your internal API vocabulary.
✅ Do: use names that describe the agent’s intent and make overlap obvious.
Tool names matter more than we want to admit. Humans read docs. Agents often take names at face value. So you rename tools. Not for beauty. For selection.
You change vague names into names that encode intent:
listbecomeslist_flows_summarysavebecomessave_flow_definitionrunbecomestest_run_flow
Tool names are part of the prompt and one of the strongest selection signals.
If two tools sound similar, the agent will confuse them. If a name hides danger or scope, the agent may choose it for the wrong job.
Tool overlap is dangerous. Humans can inspect two similar commands and decide. Agents often hesitate, pick inconsistently, waste tokens, or loop.
Good names reduce reasoning load for the model and the human reading the transcript later.
Descriptions matter too. Not giant essays. Good metadata. A clear one-line purpose. A note about when not to use it. One example if the shape is not obvious.
The agent should understand the tool from the schema.
Level 5: You combine and drop tools
❌ Don’t: create either a thousand tiny CRUD tools or one giant manage_everything tool.
✅ Do: keep narrow, distinct tools, and add workflow-shaped tools only where the workflow is real.
At some point, you realize the problem is not more tools. You need fewer, better tools.
Wrapping every REST endpoint creates a noisy tool belt. The agent wants a smaller set of useful actions.
So you combine tools: search across resources, fetch useful detail, apply a safe patch, validate before saving.
And you drop internal endpoints, tiny duplicates, dangerous tools, and tools that return huge payloads without a strong use case.
This hurts API-minded developers. We like complete surfaces. Agents like fewer good choices.
But fewer tools does not mean one mega tool. A giant argument object is just a confusing API wearing an MCP badge.
The better pattern is narrow tools that compose well, plus a few workflow-shaped tools where the workflow is real.
Each tool needs a distinct job. If two tools could answer the same request, separate their purpose or remove one.
Also, reduce required parameters. Every required input is another place to guess. If the MCP can infer a default for the agent, do it, and tell the agent you did it.
Level 6: You log what actually happens
❌ Don’t: debug agent behavior from vibes.
✅ Do: log tool selection, parameters, errors, retries, latency, token use, and outcomes.
This level probably belongs earlier. Start logging from day one.
You need to know which tools agents select incorrectly, which arguments they misunderstand, which errors cause loops, which responses are too large, which descriptions are ignored, and which workflows require too many calls.
Without logs, everyone argues from vibes.
With logs, you can see the confusion.
This is the difference between “the model is dumb” and “the tool name made three operations sound identical”.
The model will still be silly. But a lot of silliness is environmental.
Good logs let you fix the environment. Eventually, expose the useful parts back through MCP.
Give agents a way to inspect recent failures, traces, retry history, diagnostics, and correlation ids. If the agent is debugging a workflow, the MCP should be able to show it what actually happened.
Observability is not an add-on once agents operate the system. It becomes one of the tools.
This is also where MCPs become genuinely valuable: runtime visibility, failed run reasoning, historical snapshots, governance context, and environment intelligence.
That is the context agents need when the task is “figure out what is happening here”.
Level 7: You implement progressive discovery
❌ Don’t: load your whole product into the first context window.
✅ Do: put search_tools and list_skills at the top, then use search first and fetch details second.
The next problem is context. If your MCP has 12 tools, the agent can probably see them all. If it has 120, the model will not reason over them cleanly forever.
You need to love and appreciate your user’s agent’s context window. Don’t fill it with junk.
So you add progressive discovery, and put the discovery tools right at the top.
Instead of giving the agent everything up front, you provide search_tools, list_skills, get_tool_help, get_workflow_guide, or find_resource_actions.
This changes the interaction. The agent no longer needs to remember every tool. It needs to know the first move, then discover the smaller map.
Search and fetch become the scalable pattern.
Search for the relevant thing. Return stable references. Fetch detail only when needed.
This is how large MCP servers stay usable.
Do not make the first context window carry your whole product. Teach the agent how to ask for the next useful map.
Level 8: You add hints and summaries to list responses
❌ Don’t: return raw JSON with 50 fields by default.
✅ Do: return summaries, stable references, temporal context, and let the agent ask for format: "raw" when needed.
Then your logs show another failure mode. The agent did discovery correctly. It found the list. It selected a resource.
But it forgot a tool description. Or it grabs the wrong item. Or the response has too much raw shape and not enough meaning.
So your list responses get smarter: summaries, next-action hints, likely follow-up tools, stable identifiers, ambiguity warnings, counts, previews, pagination, and filters.
Agents do not always want raw JSON with 50 fields. Often, they want the useful summary: name, id, state, owner, last modified, version, purpose, and what to do next.
Then give the tool a format: "raw" argument if the agent really wants the full payload. Let it choose.
Return IDs and references consistently.
Agents chain work across steps. If one tool returns flowId, another returns id, and a third buries the id in a URL, the agent has a puzzle it did not ask for.
Do not make the agent solve your naming history.
Stateful systems need temporal context. What changed? When? By whom? Which version is live?
Time, history, versions, and snapshots stop agents treating the system like a static database.
This is one of my favorite levels because it sounds so ordinary. “Add better summaries to list responses” is not glamorous. But it works.
Agents are compressing the world. If your response gives the wrong compression, they make the wrong next move.
Good summaries preserve the decision-making parts.
Level 9: You add validation and recovery paths
❌ Don’t: return opaque errors or build tools that are dangerous to retry.
✅ Do: validate early, explain recovery, support dry runs, and make retries safe where possible.
At this point, stop assuming the agent will get every call right. It will not. Tools need to validate before mutating, return actionable errors, explain retry safety, suggest the next tool, detect malformed inputs, support dry runs, and prefer small patches.
Bad errors create loops. Good errors create recovery.
The goal is not to hide every platform complexity. It is to expose enough structure so the agent can recover without guessing.
This matters when your MCP fronts a platform with deep internal rules. The raw API error may be correct, but not useful. The MCP can translate it.
Design for retry safety. If a tool fails halfway through, the agent or host may try again.
Make tools idempotent where possible. Return operation ids. Avoid hidden side effects. Make expensive and destructive operations deliberate.
The tool should do what it says on the label. Not three extra things because the old API endpoint did that.
Level 10: You add SSO and OBO
❌ Don’t: hide every user behind one service account.
✅ Do: carry the user’s delegated identity, permissions, tenant, and environment through the tool layer.
Eventually, you have to stop pretending authentication is just an API key.
You need single sign-on, delegated identity, on-behalf-of flows, tenant and environment awareness, permission checks, and audit trails that map back to real users.
Auth still needs to feel simple to the agent and the user.
The tool call should not make the agent understand three token types and five consent states.
If an agent is acting for John, the system needs to know it is John. Not “some server token”. Not “the MCP service account”. John.
The agent should only see and do what John is allowed to see and do.
That changes the product experience because the MCP carries the user’s real context.
Level 11: You add MCP Apps
❌ Don’t: force humans to approve complex changes from a blob of JSON in chat.
✅ Do: give them visual review, diff, approve, reject, edit, and rollback surfaces.
Some tasks should not be solved with text alone.
That is where MCP Apps become interesting. Instead of asking the agent to describe everything in chat, the MCP can present an app surface: pick a record, review a diff, confirm a deployment, inspect a diagram, edit a structured object, or compare two versions.
This gives the human a better control point.
The user should not have to approve an important operation from a wall of JSON.
Agents are good at doing work. Humans still need good places to look, judge, and decide.
Preview. Simulate. Approve. Reject. Edit. Roll back. Autonomy is useful only if the human still has somewhere to review, interrupt, and steer.
Level 12: You add enterprise guards and role assignments
❌ Don’t: unlock every tool for every user and hope policy catches up.
✅ Do: expose switches, roles, audit logs, prompt-injection defenses, and traceability.
Now the MCP is no longer a developer experiment. It is an enterprise surface.
That means role assignments, tenant policies, environment policies, tool allow lists, dangerous action approvals, data boundaries, prompt-injection defenses, retention, audit logs, admin visibility, and emergency shutoff.
This is where MCP becomes more than “agent can call my API”. It becomes part of the operating model.
The organization needs to know which agents can call which tools, as which users, against which systems.
Different organizations have different risk appetite. Some allow broad inspection. Some allow read-only. Some let a pilot group edit, but not deploy.
You cannot unlock the entire tool set to everyone on day one and hope governance catches up later.
Enterprise needs the switch, role assignment, audit log, and traceability when someone asks, “why did the agent do that?”
This is explainability, not just compliance. The human needs to understand why the agent acted, what evidence it used, and which policy allowed it.
If nobody can explain the path, nobody will trust the outcome.
Prompt injection belongs here too. An MCP is often connected to documents, tickets, emails, code, logs, and other text the user did not personally write. Some of that text may contain instructions aimed at the agent.
The MCP has to treat untrusted content as data, not authority.
Do not let a string from a document quietly become a command to the tool layer.
Level 13: You design workflows, not just tools
❌ Don’t: make the agent rediscover your product workflow every time.
✅ Do: expose meaningful moves like investigate, diagnose, act, verify, and explain.
At this level, you accept that users do not care about tools.
They care about outcomes.
So the MCP starts to expose workflow-shaped capabilities: create a draft, validate a change, prepare a deployment, compare environments, explain what changed, roll back safely.
This is where operational pain belongs. Not as a slogan in the architecture diagram, but as the source material: repetitive lookup, debugging, monitoring, coordination, safe edits, and explaining what changed.
If there is no painful workflow underneath, MCP does not make the tool useful. It only makes the wrong abstraction easier to call.
The best workflow tools usually follow the same rhythm: investigate, diagnose, act, verify, explain.
Under the hood, these may call many APIs. To the agent, they are meaningful moves.
This is the difference between giving the agent parts and giving it moves it can actually use.
Keep lower-level tools where they are useful. Stop forcing the agent to rediscover your workflow every time.
Level 14: You test agent behavior
❌ Don’t: test only whether the endpoint works.
✅ Do: test whether different agents choose the right tools, recover from failure, and preserve the intended boundaries.
Once your MCP is important, you need regression tests for agent behavior. Not just unit tests for the server.
These are usually called evaluations, because you are checking how agents use the MCP, not only whether the server returns 200.
And not just one agent. All agents.
You need test prompts like: find the right flow but do not edit it, summarize without fetching every full definition, update only this field, recover from a permission error, choose read-only first, refuse dangerous action without approval, and explain why it chose this tool.
Run those evaluations whenever you change tool names, descriptions, annotations, summaries, or skills.
Prompt-facing API changes are real API changes. A rename can break agents. A shorter description can make selection worse.
Different LLMs fail differently. One needs more recovery hints. Another forgets context. Another fails to preserve a small edit boundary.
You will not know unless you measure.
The logs from Level 6 now become analytics.
You can compare LLMs, context shapes, MCP release versions, success rates, recovery rates, wrong-tool selection, retries, token use, latency, tools per task, and where agents still get stuck.
This is also why deterministic outputs matter.
Tool results are not a place for creative writing. Stable formatting makes tests easier, agents more reliable, and regressions visible.
Level 15: You treat MCP like an API product
❌ Don’t: treat tool names, schemas, and summaries as casual prompt text.
✅ Do: version them, document them, evaluate changes, and keep compatibility in mind.
Eventually, the MCP itself needs product discipline. Not just code. A contract.
That means documentation, versioning, backward compatibility, deprecation paths, changelogs, stable schemas, predictable outputs, examples, and easy setup.
This is easy to underestimate because MCP tools feel like prompts and schemas glued together. But prompt-facing API changes are still API changes.
If you rename a tool, remove a field, change a summary shape, or alter an error response, agents may behave differently.
Treat that seriously. Developer experience matters too.
If your MCP is hard to connect, test, or understand, people will not adopt it. Docs, quick starts, example prompts, and sample calls are part of the product too.
And do not force the LLM to guess. Use explicit fields, enums, defaults, and examples. If the agent has to infer your hidden business rule from a vague string, it will eventually infer wrong.
Level 16: You optimize latency
❌ Don’t: assume slow tools only make the user wait.
✅ Do: make tool chains fast enough that the agent can keep reasoning cleanly.
Latency matters more than you think. Not because agents are impatient. Because long chains are fragile.
If every tool call takes five seconds, the run starts to sag. The host may time out. The agent may make broader guesses to avoid another slow call.
Fast tools make agents feel smarter. Slow tools make agents feel confused.
So cache what can be cached. Return summaries first. Let the agent fetch raw detail only when it needs it. Avoid giant scans unless the user really asked for one.
This is not just performance work. It is agent quality work.
The order is not strict
If I were building a new MCP today, I would not follow these levels as a waterfall.
I would start with:
- A small Level 1 wrapper
- Basic logging from Level 6
- Clear names from Level 4
- Read-only annotations from Level 3
- A tiny skill from Level 2
Then I would watch real agent sessions. The logs would tell me whether to rename tools, combine tools, add summaries, or build progressive discovery.
So the levels are useful, but the feedback loop is more important. Do the smallest version. Watch the agent fail. Fix the environment.
Repeat.
The bridge
Direct API and CLI still matter. But MCP is the bridge between your APIs and non-deterministic agents.
A CLI can be written for humans and agents. An MCP should be written for agents first. Really, only agents.
And weirdly, it is the one place where we can convert an agent’s task into behavior that becomes more deterministic over time.
When an agent calls the wrong tool, gets the wrong outcome, forgets the skill, or wanders off on a tangent, we can fix that in the MCP. Rename the tool. Add the annotation. Change the summary. Add the recovery hint. Tighten the guard. Improve discovery.
The model may still be non-deterministic. But the bridge can keep improving.
The best MCPs compress complexity.
They take a complicated backend system and give the agent a simpler mental model: investigate, diagnose, act, verify, explain.
Seeing an agent recover because of something you fixed in the MCP is one of the best feelings in the world.
If all you do is expose every endpoint and hope the model figures it out, you have built Level 1.
Level 1 is a good start.
But we won’t stop there, we will level up.
Discussions