2026-04-21 Paper Digest

Automated digest of 10 arXiv papers on agent / LLM / AI infra submitted in the last 24h, analysed with Claude Code.

1. ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

arXiv: 2604.19856 · cs.AR · relevance score 27

ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA’s ACE-RTL.

Read detailed analysis →


2. GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

arXiv: 2604.19398 · cs.AI · relevance score 26

GRASPrune is a post-pretraining structured pruning framework that jointly prunes FFN channels and KV head groups under a single global budget using projected straight-through gate learning, producing a smaller dense checkpoint without fine-tuning the backbone.

Read detailed analysis →


3. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

arXiv: 2604.19299 · cs.CL · relevance score 22

This paper presents the first large-scale empirical study of sub-10B open-source SLMs across three deployment paradigms—base, single-agent with tools, and multi-agent collaboration—finding that single-agent systems offer the best cost/performance balance while multi-agent setups add overhead with limited gains.

Read detailed analysis →


4. A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

arXiv: 2604.19689 · cs.AI · relevance score 21

A-MAR is an agent-based multimodal retrieval framework that decomposes artwork queries into structured reasoning plans, then conditions retrieval on each step to produce grounded, interpretable explanations. It outperforms static retrieval and MLLM baselines on SemArt, Artpedia, and a new ArtCoT-QA benchmark.

Read detailed analysis →


5. Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

arXiv: 2604.20022 · cs.LG · relevance score 20

BMBE splits medical dialogue into an LLM “sensor” that parses utterances and a deterministic Bayesian engine that handles all diagnostic inference, yielding calibrated, private, and robust diagnosis that beats frontier standalone LLMs at a fraction of the cost.

Read detailed analysis →


6. If you’re waiting for a sign… that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

arXiv: 2604.19844 · cs.CV · relevance score 20

This paper identifies “trust boundary confusion” in Vision-Language Agentic Systems (VLAS), where agents fail to distinguish legitimate environmental signals (e.g., traffic lights) from adversarial visual injections. The authors propose a multi-agent defense that separates perception from decision-making, improving robustness while preserving responsiveness to genuine cues.

Read detailed analysis →


7. SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

arXiv: 2604.19157 · cs.LG · relevance score 20

SAW-INT4 proposes token-wise INT4 KV-cache quantization with block-diagonal Hadamard rotation, the simplest scheme compatible with paged memory and fused attention in real LLM serving. A fused rotation-quantization kernel matches plain INT4 throughput while recovering nearly all accuracy lost to naive INT4.

Read detailed analysis →


8. Detoxification for LLM: From Dataset Itself

arXiv: 2604.19124 · cs.CL · relevance score 20

The paper proposes HSPD, a pipeline that detoxifies LLM pretraining corpora at the source by rewriting toxic spans with a Soft Contrastive Decoding (SoCD) method, yielding a drop-in replacement dataset that cuts downstream model toxicity while preserving semantics.

Read detailed analysis →


9. TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

arXiv: 2604.19070 · cs.CL · relevance score 20

TRN-R1-Zero is a post-training framework that uses reinforcement learning alone to teach base LLMs to reason over text-rich networks, avoiding supervised fine-tuning or distillation while generalising across node, edge, and graph-level tasks.

Read detailed analysis →


10. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

arXiv: 2604.19533 · cs.CR · relevance score 19

Cyber Defense Benchmark evaluates LLM agents on open-ended threat hunting over raw Windows logs via iterative SQL queries. Across five frontier models, all fail dramatically — the best (Claude Opus 4.6) flags only 3.8% of malicious events, and none meet the >=50% per-tactic recall bar for unsupervised SOC deployment.

Read detailed analysis →