Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

arXiv: 2604.21816 · PDF

作者: Anuj Sadani, Deepak Kumar

主分类: cs.AI · 全部: cs.AI

命中关键词: large language model, llm, agent, agentic, reasoning, attention, latency

TL;DR

提出 Tool Attention 中间件：用意图-schema 嵌入相似度 + 状态门控 + 两阶段懒加载，将 MCP 每轮工具 token 开销削减 95%，缓解"MCP Tax"。

核心观点

MCP 协议的 eager schema 注入在多服务器部署下每轮消耗 10k–60k tokens，膨胀 KV cache 并在约 70% 上下文利用率处触发推理退化。
可将"Attention Is All You Need"范式从 token 级自注意力推广到 tool 级门控注意力。
可扩展 agent 的瓶颈是协议层效率，而非 raw context length。

方法

Tool Attention 由三部分组成：

ISO 分数：用 sentence embedding 计算用户 intent 与工具 schema 的语义重叠。
状态感知门控：强制执行 precondition 与 access scope，过滤当前不可用工具。
两阶段懒 schema 加载器：上下文中只保留压缩摘要池，仅对 top-k 被门控通过的工具提升为完整 JSON schema。作为 middleware 部署，不改 LLM 本体。

实验

基准：模拟 120 工具、6 MCP server 环境；每 server 的 token 数按公开 MCP 部署审计数据校准。
指标：per-turn tool token 数、有效上下文利用率；task success / latency / cost / reasoning quality 仅作投影（projection）。
基线：eager schema injection 的默认 MCP 行为。

结果

per-turn tool tokens：47.3k → 2.4k，降低 95.0%。
有效上下文利用率（token ratio）：24% → 91%。
端到端的成功率、时延、成本、推理质量均为基于 token 数 + 已发布遥测推导的 projection，论文明确标注，未在真实 LLM agent 上实测。

为什么重要

对 agent / MCP 基础设施从业者，给出了可直接集成的 middleware，缓解多 server 场景下的 token 税与 KV cache 膨胀，延迟 context fracture 的到来，降低每轮推理成本。

与已有工作的关系

延续 Transformer “Attention Is All You Need” 的门控思想，但对象换成 tools。
针对 MCP 协议 eager schema injection 的已知痛点（多方 practitioner audit）。
与 RAG 式 tool retrieval、tool selection via embedding（如 ToolLLM、Gorilla）思路同源，但在协议/中间件层实现，并加入 precondition 门控和两阶段懒加载。

尚未回答的问题

端到端指标只是 projection，真实 LLM agent 下的 success rate 与推理质量如何？
top-k 与摘要池大小的选择对长尾工具召回与误 gate 的影响？
状态门控的 precondition 维护成本，以及跨 server schema drift 的鲁棒性？
对抗/多跳任务中，ISO 嵌入是否足以捕捉深层意图？是否需要可学习的 gating？

论文图表

图 1: Page 2 (rendered)

图 1

图 2: Page 3 (rendered)

图 2

图 3: Page 4 (rendered)

图 3

原始摘要

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the “Attention Is All You Need” paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention