2026-04-21 Paper Digest
Automated digest of 10 arXiv papers on agent / LLM / AI infra submitted in the last 24h, analysed with Claude Code.
1. ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration
arXiv: 2604.19856 · cs.AR · relevance score 27
ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA’s ACE-RTL.
2. GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
arXiv: 2604.19398 · cs.AI · relevance score 26
GRASPrune is a post-pretraining structured pruning framework that jointly prunes FFN channels and KV head groups under a single global budget using projected straight-through gate learning, producing a smaller dense checkpoint without fine-tuning the backbone.
3. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
arXiv: 2604.19299 · cs.CL · relevance score 22
This paper presents the first large-scale empirical study of sub-10B open-source SLMs across three deployment paradigms—base, single-agent with tools, and multi-agent collaboration—finding that single-agent systems offer the best cost/performance balance while multi-agent setups add overhead with limited gains.
4. A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
arXiv: 2604.19689 · cs.AI · relevance score 21
A-MAR is an agent-based multimodal retrieval framework that decomposes artwork queries into structured reasoning plans, then conditions retrieval on each step to produce grounded, interpretable explanations. It outperforms static retrieval and MLLM baselines on SemArt, Artpedia, and a new ArtCoT-QA benchmark.
5. Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine
arXiv: 2604.20022 · cs.LG · relevance score 20
BMBE splits medical dialogue into an LLM “sensor” that parses utterances and a deterministic Bayesian engine that handles all diagnostic inference, yielding calibrated, private, and robust diagnosis that beats frontier standalone LLMs at a fraction of the cost.
6. If you’re waiting for a sign… that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
arXiv: 2604.19844 · cs.CV · relevance score 20
This paper identifies “trust boundary confusion” in Vision-Language Agentic Systems (VLAS), where agents fail to distinguish legitimate environmental signals (e.g., traffic lights) from adversarial visual injections. The authors propose a multi-agent defense that separates perception from decision-making, improving robustness while preserving responsiveness to genuine cues.
7. SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
arXiv: 2604.19157 · cs.LG · relevance score 20
SAW-INT4 proposes token-wise INT4 KV-cache quantization with block-diagonal Hadamard rotation, the simplest scheme compatible with paged memory and fused attention in real LLM serving. A fused rotation-quantization kernel matches plain INT4 throughput while recovering nearly all accuracy lost to naive INT4.
8. Detoxification for LLM: From Dataset Itself
arXiv: 2604.19124 · cs.CL · relevance score 20
The paper proposes HSPD, a pipeline that detoxifies LLM pretraining corpora at the source by rewriting toxic spans with a Soft Contrastive Decoding (SoCD) method, yielding a drop-in replacement dataset that cuts downstream model toxicity while preserving semantics.
9. TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
arXiv: 2604.19070 · cs.CL · relevance score 20
TRN-R1-Zero is a post-training framework that uses reinforcement learning alone to teach base LLMs to reason over text-rich networks, avoiding supervised fine-tuning or distillation while generalising across node, edge, and graph-level tasks.
10. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
arXiv: 2604.19533 · cs.CR · relevance score 19
Cyber Defense Benchmark evaluates LLM agents on open-ended threat hunting over raw Windows logs via iterative SQL queries. Across five frontier models, all fail dramatically — the best (Claude Opus 4.6) flags only 3.8% of malicious events, and none meet the >=50% per-tactic recall bar for unsupervised SOC deployment.
- April 27, 2026 Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
- April 27, 2026 TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
- April 27, 2026 Detoxification for LLM: From Dataset Itself
- April 27, 2026 SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
- April 27, 2026 If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
- April 27, 2026 Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine
- April 27, 2026 A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
- April 27, 2026 Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
- April 27, 2026 GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
- April 27, 2026 ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration