arXiv: 2604.21816 · PDF

作者: Anuj Sadani, Deepak Kumar

主分类: cs.AI · 全部: cs.AI

命中关键词: large language model, llm, agent, agentic, reasoning, attention, latency


TL;DR

提出 Tool Attention 中间件:用意图-schema 嵌入相似度 + 状态门控 + 两阶段懒加载,将 MCP 每轮工具 token 开销削减 95%,缓解"MCP Tax"。

核心观点

  • MCP 协议的 eager schema 注入在多服务器部署下每轮消耗 10k–60k tokens,膨胀 KV cache 并在约 70% 上下文利用率处触发推理退化。
  • 可将"Attention Is All You Need"范式从 token 级自注意力推广到 tool 级门控注意力。
  • 可扩展 agent 的瓶颈是协议层效率,而非 raw context length。

方法

Tool Attention 由三部分组成:

  • ISO 分数:用 sentence embedding 计算用户 intent 与工具 schema 的语义重叠。
  • 状态感知门控:强制执行 precondition 与 access scope,过滤当前不可用工具。
  • 两阶段懒 schema 加载器:上下文中只保留压缩摘要池,仅对 top-k 被门控通过的工具提升为完整 JSON schema。 作为 middleware 部署,不改 LLM 本体。

实验

  • 基准:模拟 120 工具、6 MCP server 环境;每 server 的 token 数按公开 MCP 部署审计数据校准。
  • 指标:per-turn tool token 数、有效上下文利用率;task success / latency / cost / reasoning quality 仅作投影(projection)。
  • 基线:eager schema injection 的默认 MCP 行为。

结果

  • per-turn tool tokens:47.3k → 2.4k,降低 95.0%。
  • 有效上下文利用率(token ratio):24% → 91%。
  • 端到端的成功率、时延、成本、推理质量均为基于 token 数 + 已发布遥测推导的 projection,论文明确标注,未在真实 LLM agent 上实测。

为什么重要

对 agent / MCP 基础设施从业者,给出了可直接集成的 middleware,缓解多 server 场景下的 token 税与 KV cache 膨胀,延迟 context fracture 的到来,降低每轮推理成本。

与已有工作的关系

  • 延续 Transformer “Attention Is All You Need” 的门控思想,但对象换成 tools。
  • 针对 MCP 协议 eager schema injection 的已知痛点(多方 practitioner audit)。
  • 与 RAG 式 tool retrieval、tool selection via embedding(如 ToolLLM、Gorilla)思路同源,但在协议/中间件层实现,并加入 precondition 门控和两阶段懒加载。

尚未回答的问题

  • 端到端指标只是 projection,真实 LLM agent 下的 success rate 与推理质量如何?
  • top-k 与摘要池大小的选择对长尾工具召回与误 gate 的影响?
  • 状态门控的 precondition 维护成本,以及跨 server schema drift 的鲁棒性?
  • 对抗/多跳任务中,ISO 嵌入是否足以捕捉深层意图?是否需要可学习的 gating?

论文图表

图 1: Page 2 (rendered)

图 1

图 2: Page 3 (rendered)

图 2

图 3: Page 4 (rendered)

图 3


原始摘要

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the “Attention Is All You Need” paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention