MCP · June 23, 2026

76% of MCP Tool Schemas Score an F. Loading the Other 24% Still Eats Your Context Budget.

ToolBench graded 218,422 MCP tool definitions in March 2026: 76.6% scored F. Meanwhile three production MCP servers consumed 143K of 200K available context tokens before a single query ran.

One in every 133 MCP servers earns a grade of A from ToolBench — a quality benchmark that evaluated all 218,422 tool definitions indexed across 41,902 public MCP servers in March 2026. The other 132 did not simply fall short on style: 76.6% of all analyzed tool schemas received an outright F, failing on naming clarity, description completeness, or parameter schema rigor. That is the quality side of the problem.

The cost side arrived in parallel. A production audit by API integration company Apideck loaded three mainstream MCP servers concurrently — repositories, team messaging, and error tracking — totaling approximately 40 tool schemas. Those schemas consumed 143,000 of 200,000 available context tokens before a single user query was processed. Quality and context cost are the same crisis viewed from opposite ends.

Method

Quality data comes from the ToolBench benchmark published by Arcade.dev on March 18, 2026. It scores servers across four dimensions: definition quality, protocol compliance, security, and supportability. Per-tool token measurements were drawn from GitHub issue #2808 in the MCP specification repository, where contributors measured token consumption by serializing each tool definition as its full JSON Schema — including field descriptions, type definitions, and nested object structures — via a model provider's token-counting API. The Apideck context audit was published to their engineering blog in spring 2026.

Quality: 76.6% of Tool Schemas Score F

Of 218,422 tool definitions analyzed, 167,333 received an F. The grade-A threshold requires clean and unambiguous tool naming, substantive descriptions on every parameter (not just the tool itself), well-typed enums or properly constrained schemas, and fully valid protocol behavior across initialization and transport. Only 0.5% of servers — roughly 210 of 41,902 indexed — cleared all of it.

ToolBench Quality Grade Distribution: 218,422 MCP Tool Definitions (March 2026)

76.6% of analyzed tool schemas received an F; only 0.5% of servers graded A or above.

Source: Arcade.dev ToolBench

The most common failure modes are mundane: tool names that convey no intent ("execute", "call", "run", "handle"), description fields that restate the parameter name rather than specifying what values are accepted and under what conditions, and missing or incorrect required fields in JSON Schema objects. For an AI assistant deciding which tool to invoke, a schema like this is noise. Selecting the right tool when descriptions are interchangeable or absent becomes a coin flip.

Definition quality carries 50% of the total score for stdio-transport servers and somewhat less for remote HTTP servers where protocol compliance also matters. In either case, a description-quality failure alone is sufficient to earn a failing grade.

The Token Cost of What Passes

Clearing the quality bar does not eliminate token cost — it only makes the cost worth paying. GitHub issue #2808 provides the most granular per-tool token measurement available from the public specification discussion: an 11-tool production server, measured with full JSON Schema serialization.

Token Cost per Tool Definition in an 11-Tool MCP Server (Full JSON Schema Serialization)

Range: 103-1,024 tokens. Simple status tools cost 10x less than complex nested schemas. Total: 8,267 tokens.

Source: GitHub modelcontextprotocol/modelcontextprotocol issue #2808

The variation spans an order of magnitude within a single server: 103 tokens for a simple status-check tool with no parameters, 1,024 tokens for a batch execution tool with nested object schemas and enumerated options. Total across all 11 tools: 8,267 tokens — consumed before the first user message is sent.

Three MCP servers loaded together at 40 tools total consumed 143,000 of 200,000 available context tokens in the Apideck audit. A model's effective reasoning budget for the actual task is what remains. A 75-comparison study of MCP against equivalent CLI operations found a 4x to 32x token overhead for MCP across task types. The widest gap came from a routine read operation — checking a repository's primary language — at 1,365 tokens via CLI and 44,026 via MCP: a 32x difference attributable almost entirely to schema serialization overhead.

What This Means for Site Owners

If you are building or maintaining an MCP server for your site, schema quality is not a secondary concern. Research published in early 2026 on tool description augmentation found that improving descriptions from minimal to substantive yielded a statistically significant 5.85 percentage-point increase in task success rate across a range of AI models and domains. An AI assistant is measurably more likely to call your tool correctly when the schema explains what parameters mean, what values are valid, and under what circumstances the tool applies — rather than just naming the parameter.

The token cost is a fixed per-session overhead that every consumer of your server pays regardless of what work they actually do. At a rough average of 750 tokens per tool based on the issue #2808 measurements, a 20-tool server contributes approximately 15,000 tokens of startup overhead. For high-volume deployments, this accumulates quickly even when prompt caching is active.

Practical adjustments: keep tool counts per server below 20 where the use case permits, and split broad-purpose servers into narrower ones that consumers can selectively load. Write tool descriptions that answer "in which situation should an agent call this?" rather than "what does this function do?" — that framing corresponds to the distinction between A-graded and F-graded schemas in ToolBench's definition-quality rubric. Lazy loading of tool schemas at invocation time, rather than at session initialization, has shown roughly 85% reduction in startup token cost in early production data.

The 0.5% of servers earning A grade in ToolBench's March 2026 snapshot already do these things. The other 99.5% are shipping schemas that fail the models consuming them and burden every session with token overhead that produces no value.

76% of MCP Tool Schemas Score an F. Loading the Other 24% Still Eats Your Context Budget.

Method

Quality: 76.6% of Tool Schemas Score F

The Token Cost of What Passes

What This Means for Site Owners

Sources