ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

arXiv: 2604.19856 · PDF

Authors: Cagri Eryilmaz

Primary category: cs.AR · all: cs.AI, cs.AR, cs.LG

Matched keywords: large language model, llm, agent, agentic, multi-agent, retrieval, rag, reasoning

TL;DR

ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA’s ACE-RTL.

Key Ideas

Adaptive orchestration of six specialized agents via a PPO policy over a 168-dim state (with an MPC world-model alternative).
Hybrid symbolic-neural architecture: algorithmic solvers for K-maps/truth tables, neural agents for waveforms and general RTL.
Knowledge-augmented retrieval from 321 patterns + 971 open-source reference implementations with focus-aware lookup.
Hierarchical spec decomposition into dependency-ordered sub-modules with interface synchronization.

Approach

A controller learns (PPO) to route tasks among six agents depending on problem state. Symbolic solvers handle combinational logic exactly; neural agents handle timing/waveforms. A retrieval module injects reference patterns. Complex specs are decomposed hierarchically with cross-module interface synchronization before code generation and validation.

Experiments

VerilogEval-Human: pass@1, 7 runs.
CVDP non-agentic subset: 302 problems, 5 task categories, 3 runs; compared with single-shot baseline and NVIDIA’s ACE-RTL.
RISC-V SoC case study: 8-module hierarchical decomposition, lint + FPGA validation.
Baselines: MAGE, ChipAgents (self-reported), ACE-RTL, published single-shot.

Results

VerilogEval-Human: 97.2% mean pass@1 (96.15-98.72%), best 154/156; ties ChipAgents, beats MAGE 95.9%.
CVDP subset: 94.7% (286/302), +36-60 pp per category over single-shot; leads 3/4 shared categories with ACE-RTL at ~30× fewer attempts.
RISC-V SoC: 8/8 lint-passing modules (689 LOC), FPGA-validated; monolithic baseline fails outright.

Why It Matters

Pushes LLM-driven RTL design from toy benchmarks toward industrial viability (CVDP, real SoC), showing that symbolic offloading plus RL-based orchestration can both raise accuracy and cut API cost—relevant for agent designers budgeting compute and infra teams exploring chip design automation.

Connections to Prior Work

Multi-agent RTL: MAGE, ChipAgents, NVIDIA ACE-RTL.
Benchmarks: VerilogEval, CVDP.
Symbolic-neural hybrids for program synthesis.
PPO-based agent orchestration and MPC/world-model planning.
Retrieval-augmented code generation.

Open Questions

How robust is the PPO policy under distribution shift to unseen IP styles?
Performance on agentic CVDP tasks (explicitly excluded here) and larger SoCs beyond RISC-V.
Cost/latency breakdown vs. MAGE and ChipAgents at equal budgets.
Sensitivity to the 321+971 knowledge base—does it generalize without curated patterns?
Independent reproduction of ChipAgents’ self-reported 97.4%.

Figures

Figure 1: Page 2 (rendered)

Figure 2: Page 3 (rendered)

Figure 3: Page 4 (rendered)

Original abstract

Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA’s CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA’s ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.