arXiv: 2604.19856 · PDF

Authors: Cagri Eryilmaz

Primary category: cs.AR · all: cs.AI, cs.AR, cs.LG

Matched keywords: large language model, llm, agent, agentic, multi-agent, retrieval, rag, reasoning


TL;DR

ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA’s ACE-RTL.

Key Ideas

  • Adaptive orchestration of six specialized agents via a PPO policy over a 168-dim state (with an MPC world-model alternative).
  • Hybrid symbolic-neural architecture: algorithmic solvers for K-maps/truth tables, neural agents for waveforms and general RTL.
  • Knowledge-augmented retrieval from 321 patterns + 971 open-source reference implementations with focus-aware lookup.
  • Hierarchical spec decomposition into dependency-ordered sub-modules with interface synchronization.

Approach

A controller learns (PPO) to route tasks among six agents depending on problem state. Symbolic solvers handle combinational logic exactly; neural agents handle timing/waveforms. A retrieval module injects reference patterns. Complex specs are decomposed hierarchically with cross-module interface synchronization before code generation and validation.

Experiments

  • VerilogEval-Human: pass@1, 7 runs.
  • CVDP non-agentic subset: 302 problems, 5 task categories, 3 runs; compared with single-shot baseline and NVIDIA’s ACE-RTL.
  • RISC-V SoC case study: 8-module hierarchical decomposition, lint + FPGA validation.
  • Baselines: MAGE, ChipAgents (self-reported), ACE-RTL, published single-shot.

Results

  • VerilogEval-Human: 97.2% mean pass@1 (96.15-98.72%), best 154/156; ties ChipAgents, beats MAGE 95.9%.
  • CVDP subset: 94.7% (286/302), +36-60 pp per category over single-shot; leads 3/4 shared categories with ACE-RTL at ~30× fewer attempts.
  • RISC-V SoC: 8/8 lint-passing modules (689 LOC), FPGA-validated; monolithic baseline fails outright.

Why It Matters

Pushes LLM-driven RTL design from toy benchmarks toward industrial viability (CVDP, real SoC), showing that symbolic offloading plus RL-based orchestration can both raise accuracy and cut API cost—relevant for agent designers budgeting compute and infra teams exploring chip design automation.

Connections to Prior Work

  • Multi-agent RTL: MAGE, ChipAgents, NVIDIA ACE-RTL.
  • Benchmarks: VerilogEval, CVDP.
  • Symbolic-neural hybrids for program synthesis.
  • PPO-based agent orchestration and MPC/world-model planning.
  • Retrieval-augmented code generation.

Open Questions

  • How robust is the PPO policy under distribution shift to unseen IP styles?
  • Performance on agentic CVDP tasks (explicitly excluded here) and larger SoCs beyond RISC-V.
  • Cost/latency breakdown vs. MAGE and ChipAgents at equal budgets.
  • Sensitivity to the 321+971 knowledge base—does it generalize without curated patterns?
  • Independent reproduction of ChipAgents’ self-reported 97.4%.

Figures

Figure 1: Page 2 (rendered)

Figure 1

Figure 2: Page 3 (rendered)

Figure 2

Figure 3: Page 4 (rendered)

Figure 3


Original abstract

Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA’s CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA’s ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.