arXiv: 2604.22577 · PDF

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia

Affiliations: Huawei Technologies, National University of Singapore, University of Science and Technology of China

Primary category: cs.AI · all: cs.AI, cs.CL

Matched keywords: agent, reasoning, inference, serving, quantization, latency


TL;DR

QuantClaw is a plug-and-play precision routing plugin for OpenClaw agent systems that dynamically assigns quantization precision per task, cutting cost up to 21.4% and latency 15.7% on GLM-5 vs an FP8 baseline while preserving task quality.

Key Ideas

  • Quantization sensitivity in agent workflows is highly task-dependent, not uniform.
  • Precision should be treated as a dynamic resource, routed per task.
  • A lightweight routing plugin can deliver cost/latency wins without user-facing complexity.
  • NVFP4 degradation follows a power-law with model size (diminishing with scale).

Approach

QuantClaw sits in front of OpenClaw as a precision router: it profiles task characteristics, detects task type, and routes lightweight tasks to lower-precision (cheaper) configurations while keeping demanding workloads at higher precision. It consolidates multiple task detectors into an automatic adaptation layer and makes on-the-fly routing decisions, preserving a single user interface.

Figure 2

Experiments

  • Platform: OpenClaw autonomous agent system.
  • Model: GLM-5 with an FP8 baseline (plus NVFP4 scaling study across model sizes).
  • Workloads: a range of complex multi-turn agent tasks spanning long-context inputs.
  • Metrics: task performance, monetary cost, inference latency.
  • Baselines: static FP8 quantization; implicit comparison to uniform-precision strategies.

Results

  • Up to 21.4% cost savings and 15.7% latency reduction on GLM-5 over the FP8 baseline.
  • Task performance maintained or improved vs static quantization.
  • NVFP4 quantization degradation shrinks as model size grows, following a power-law in log-log space.

Figure 1

Why It Matters

For agent/infra practitioners, this reframes quantization from a global deployment knob to a per-request routing decision. It suggests real savings on long-context, multi-turn agent stacks without retraining, and provides empirical evidence that bigger models tolerate aggressive low-precision formats like NVFP4 better.

Connections to Prior Work

  • Low-precision LLM serving: FP8, NVFP4, and mixed-precision inference.
  • Quantization scaling laws (extending power-law analyses of model behavior to quantization error).
  • Adaptive/routing inference: model cascades, MoE routing, speculative decoding — here applied to precision rather than model choice.
  • Agent frameworks with long-context, multi-turn tool use (OpenClaw-style systems).

Open Questions

  • How is the router trained/calibrated, and does it generalize beyond GLM-5 and OpenClaw?
  • Robustness under distribution shift or adversarial task mixes.
  • Interaction with KV-cache quantization, speculative decoding, and batching.
  • Whether gains hold at smaller model sizes where NVFP4 degradation is larger.
  • Overhead of the routing/detection layer itself at high QPS.

Original abstract

Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.