76% of MCP Tool Schemas Score an F. Loading the Other 24% Still Eats Your Context Budget.
ToolBench graded 218,422 MCP tool definitions in March 2026: 76.6% scored F. Meanwhile three production MCP servers consumed 143K of 200K available context tokens before a single query ran.
One in every 133 MCP servers earns a grade of A from ToolBench — a quality benchmark that evaluated all 218,422 tool definitions indexed across 41,902 public MCP servers in March 2026. The other 132 did not simply fall short on style: 76.6% of all analyzed tool schemas received an outright F, failing on naming clarity, description completeness, or parameter schema rigor. That is the quality side of the problem.
The cost side arrived in parallel. A production audit by API integration company Apideck loaded three mainstream MCP servers concurrently — repositories, team messaging, and error tracking — totaling approximately 40 tool schemas. Those schemas consumed 143,000 of 200,000 available context tokens before a single user query was processed. Quality and context cost are the same crisis viewed from opposite ends.
Method
Quality data comes from the ToolBench benchmark published by Arcade.dev on March 18, 2026. It scores servers across four dimensions: definition quality, protocol compliance, security, and supportability. Per-tool token measurements were drawn from GitHub issue #2808 in the MCP specification repository, where contributors measured token consumption by serializing each tool definition as its full JSON Schema — including field descriptions, type definitions, and nested object structures — via a model provider's token-counting API. The Apideck context audit was published to their engineering blog in spring 2026.
Quality: 76.6% of Tool Schemas Score F
Of 218,422 tool definitions analyzed, 167,333 received an F. The grade-A threshold requires clean and unambiguous tool naming, substantive descriptions on every parameter (not just the tool itself), well-typed enums or properly constrained schemas, and fully valid protocol behavior across initialization and transport. Only 0.5% of servers — roughly 210 of 41,902 indexed — cleared all of it.
The most common failure modes are mundane: tool names that convey no intent ("execute", "call", "run", "handle"), description fields that restate the parameter name rather than specifying what values are accepted and under what conditions, and missing or incorrect required fields in JSON Schema objects. For an AI assistant deciding which tool to invoke, a schema like this is noise. Selecting the right tool when descriptions are interchangeable or absent becomes a coin flip.
Definition quality carries 50% of the total score for stdio-transport servers and somewhat less for remote HTTP servers where protocol compliance also matters. In either case, a description-quality failure alone is sufficient to earn a failing grade.
The Token Cost of What Passes
Clearing the quality bar does not eliminate token cost — it only makes the cost worth paying. GitHub issue #2808 provides the most granular per-tool token measurement available from the public specification discussion: an 11-tool production server, measured with full JSON Schema serialization.
The variation spans an order of magnitude within a single server: 103 tokens for a simple status-check tool with no parameters, 1,024 tokens for a batch execution tool with nested object schemas and enumerated options. Total across all 11 tools: 8,267 tokens — consumed before the first user message is sent.
Three MCP servers loaded together at 40 tools total consumed 143,000 of 200,000 available context tokens in the Apideck audit. A model's effective reasoning budget for the actual task is what remains. A 75-comparison study of MCP against equivalent CLI operations found a 4x to 32x token overhead for MCP across task types. The widest gap came from a routine read operation — checking a repository's primary language — at 1,365 tokens via CLI and 44,026 via MCP: a 32x difference attributable almost entirely to schema serialization overhead.
What This Means for Site Owners
If you are building or maintaining an MCP server for your site, schema quality is not a secondary concern. Research published in early 2026 on tool description augmentation found that improving descriptions from minimal to substantive yielded a statistically significant 5.85 percentage-point increase in task success rate across a range of AI models and domains. An AI assistant is measurably more likely to call your tool correctly when the schema explains what parameters mean, what values are valid, and under what circumstances the tool applies — rather than just naming the parameter.
The token cost is a fixed per-session overhead that every consumer of your server pays regardless of what work they actually do. At a rough average of 750 tokens per tool based on the issue #2808 measurements, a 20-tool server contributes approximately 15,000 tokens of startup overhead. For high-volume deployments, this accumulates quickly even when prompt caching is active.
Practical adjustments: keep tool counts per server below 20 where the use case permits, and split broad-purpose servers into narrower ones that consumers can selectively load. Write tool descriptions that answer "in which situation should an agent call this?" rather than "what does this function do?" — that framing corresponds to the distinction between A-graded and F-graded schemas in ToolBench's definition-quality rubric. Lazy loading of tool schemas at invocation time, rather than at session initialization, has shown roughly 85% reduction in startup token cost in early production data.
The 0.5% of servers earning A grade in ToolBench's March 2026 snapshot already do these things. The other 99.5% are shipping schemas that fail the models consuming them and burden every session with token overhead that produces no value.
Sources
- ToolBench: A Quality Benchmark for MCP Servers
- MCP spec should address tool schema token overhead (~1000 tokens/tool consumed per session)
- Your MCP Server Is Eating Your Context Window. There's a Simpler Way
- MCP Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions
- MCP Token Trap: Why Your AI Agent Burns 35x More Tokens Than a CLI