ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

arXiv: 2604.19856 · PDF

作者: Cagri Eryilmaz

主分类: cs.AR · 全部: cs.AI, cs.AR, cs.LG

命中关键词: large language model, llm, agent, agentic, multi-agent, retrieval, rag, reasoning

TL;DR

ChipCraftBrain 用多 agent 编排加符号-神经混合推理做 RTL 生成，在 VerilogEval-Human 达到 97.2% pass@1，在 CVDP 子集达 94.7%，并成功跑通 RISC-V SoC 分层生成。

核心观点

单次生成 RTL 正确率仅 60-65%，现有多 agent（MAGE）在更难的工业基准 CVDP 上未验证且成本高。
提出 validation-first 的多 agent 框架，结合 PPO 自适应编排、符号-神经混合、知识检索与层次化分解。
在 VerilogEval-Human 和 CVDP 非 agentic 子集上均取得 SOTA 级结果，且 per-problem 调用次数比 ACE-RTL 少约 30 倍。

方法

四项创新：

自适应编排：168 维状态上用 PPO 策略调度 6 个专用 agent，另外评估了 world-model MPC planner 作为替代。
符号-神经混合：K-map、真值表等用算法精确求解，波形时序和通用 RTL 交给专用 agent。
知识增强生成：321 个模式 + 971 个开源参考实现，使用 focus-aware retrieval。
层次化规格分解：依赖排序的子模块 + 接口同步，支持大型 SoC。

实验

基准：VerilogEval-Human（156 题，7 次运行）；NVIDIA CVDP 的 302 题非 agentic 子集（5 类任务，3 次运行）；RISC-V SoC 案例。
基线：MAGE、ChipAgents（自报）、CVDP 单次基线、NVIDIA ACE-RTL。
指标：pass@1、lint、FPGA 验证、per-problem 尝试次数。

结果

VerilogEval-Human：mean pass@1 = 97.2%（96.15-98.72%，最佳 154/156），与 ChipAgents 97.4% 持平，高于 MAGE 95.9%。
CVDP 子集：94.7%（286/302），每类比单次基线提升 36-60 百分点；在与 ACE-RTL 共享的 4 类中赢 3 类，调用次数约 1/30。
RISC-V SoC：分层分解生成 8/8 模块（689 LOC）lint 通过，FPGA 验证成功；单体生成完全失败。

为什么重要

为硬件 EDA 提供了首个在 CVDP 这类工业级基准上站得住脚、成本更低的 agentic RTL 生成方案，显示符号求解 + LLM 混合和分层规格分解能把 LLM 推向真实芯片设计流程。

与已有工作的关系

多 agent RTL：MAGE、ChipAgents。
工业基准：NVIDIA CVDP、ACE-RTL。
LLM for HDL：VerilogEval 系列、单次生成基线。
方法学：PPO 编排、world-model MPC、RAG、程序分解。

尚未回答的问题

在 CVDP 的 agentic（需交互/testbench）子集与完整集上的表现如何。
PPO 编排相对固定流水线的消融收益、以及 MPC planner 的优劣对比。
321 模式 + 971 参考的覆盖偏差与污染风险，是否能推广到非公开 IP。
FPGA 验证仅限小型 RISC-V SoC，扩展到多核、缓存、异步域等是否仍可行。
综合、时序收敛、功耗感知仍未纳入闭环。

论文图表

图 1: Page 2 (rendered)

图 1

图 2: Page 3 (rendered)

图 2

图 3: Page 4 (rendered)

图 3

原始摘要

Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA’s CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA’s ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.