Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Authors: Anuj Sadani, Deepak Kumar

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, agentic, reasoning, attention, latency

TL;DR

Tool Attention is a middleware layer that replaces MCP’s eager schema injection with intent-gated, lazy schema loading — cutting per-turn tool tokens by 95% in simulation and arguing that protocol efficiency, not context length, is the real bottleneck for scalable agentic systems.

Key Ideas

The “MCP Tax” (10k–60k tokens/turn) inflates KV cache and pushes context past known reasoning-degradation thresholds (~70%).
Generalize self-attention into attention over tools: score, gate, then selectively expose schemas.
Protocol-level efficiency is a tighter constraint than raw context window size.

Approach

A middleware sitting between agent and MCP servers with three components:

Intent Schema Overlap (ISO) — sentence-embedding similarity between user intent and tool descriptions.
State-aware gating — enforces preconditions and access scopes before a tool is exposed.
Two-phase lazy schema loader — keeps a compact summary pool in context; promotes full JSON schemas only for top-k gated tools.

Experiments

Simulated benchmark: 120 tools across 6 MCP servers, with per-server token counts calibrated to public audits of real MCP deployments. Metric focus: per-turn tool tokens and effective context utilization. Downstream metrics (success, latency, cost, reasoning quality) are projected, not measured.

Results

Per-turn tool tokens: 47.3k → 2.4k (−95.0%).
Effective context utilization (token-ratio): 24% → 91%.
Task success / latency / cost / reasoning numbers are projections from token counts plus published telemetry — explicitly flagged, not run against live LLMs.

Why It Matters

If the measured token savings translate, agent infra teams can collapse MCP overhead by an order of magnitude, keeping context below fracture points without shrinking toolsets. Reframes agent scaling as a protocol/middleware problem rather than a longer-context problem.

Connections to Prior Work

MCP (Anthropic) as the substrate being optimized.
“Attention Is All You Need” — explicit paradigm lift from token attention to tool attention.
Retrieval-augmented tool selection (Toolformer, ToolLLM, Gorilla) — similar spirit, different layer.
Lost-in-the-middle / context fracture literature motivating the ~70% utilization threshold.

Open Questions

No live-agent end-to-end evaluation — do projected success/latency gains hold empirically?
Simulation is calibrated but synthetic; behavior on real heterogeneous MCP servers is unknown.
ISO gating could miss tools needing multi-hop composition; recall/precision of the gate isn’t reported.
How does Tool Attention interact with caching, streaming, or adversarial/ambiguous intents?
Sensitivity to embedding model choice and top-k is not characterized.

Figures

Figure 1: Page 2 (rendered)

Figure 2: Page 3 (rendered)

Figure 3: Page 4 (rendered)

Original abstract

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the “Attention Is All You Need” paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention