2026-04-21 Paper Digest on JXIN's Home

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Mon, 27 Apr 2026 05:22:40 +0000

arXiv: 2604.19533 · PDF

Authors: Alankrit Chona, Igor Kozlov, Ambuj Kumar

Primary category: cs.CR · all: cs.AI, cs.CR

Matched keywords: large language model, llm, agent, agentic, rag

TL;DR

Cyber Defense Benchmark evaluates LLM agents on open-ended threat hunting over raw Windows logs via iterative SQL queries. Across five frontier models, all fail dramatically — the best (Claude Opus 4.6) flags only 3.8% of malicious events, and none meet the >=50% per-tactic recall bar for unsupervised SOC deployment.

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

Mon, 27 Apr 2026 05:21:53 +0000

arXiv: 2604.19070 · PDF

Authors: Yilun Liu, Ruihong Qiu, Zi Huang

Primary category: cs.CL · all: cs.CL, cs.LG

Matched keywords: large language model, llm, reasoning, chain-of-thought, inference, fine-tun, post-train

TL;DR

TRN-R1-Zero is a post-training framework that uses reinforcement learning alone to teach base LLMs to reason over text-rich networks, avoiding supervised fine-tuning or distillation while generalising across node, edge, and graph-level tasks.

Key Ideas

RL-only post-training for text-rich network (TRN) reasoning — no SFT, no CoT distillation from larger teachers.
Neighbour-aware Group Relative Policy Optimisation (N-GRPO) that shapes rewards via a novel “margin gain” metric measuring neighbour informativeness.
Node-level training transfers zero-shot to edge- and graph-level tasks, beyond typical cross-domain transfer.

Approach

The authors extend GRPO with neighbourhood awareness: for each candidate response, rewards are dynamically adjusted by a margin gain metric capturing how much neighbouring node signals contribute to the correct answer, pushing the LLM to actually use relational context rather than text alone. Training runs only on node-level supervision signals via RL on base LLMs.

Detoxification for LLM: From Dataset Itself

Mon, 27 Apr 2026 05:21:19 +0000

arXiv: 2604.19124 · PDF

Authors: Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, inference, serving, fine-tun, post-train

TL;DR

The paper proposes HSPD, a pipeline that detoxifies LLM pretraining corpora at the source by rewriting toxic spans with a Soft Contrastive Decoding (SoCD) method, yielding a drop-in replacement dataset that cuts downstream model toxicity while preserving semantics.

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Mon, 27 Apr 2026 05:20:49 +0000

arXiv: 2604.19157 · PDF

Authors: Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, Xiaoxia Wu

Primary category: cs.LG · all: cs.LG

Matched keywords: llm, serving, kv-cache, quantization, attention, throughput, latency

TL;DR

SAW-INT4 proposes token-wise INT4 KV-cache quantization with block-diagonal Hadamard rotation, the simplest scheme compatible with paged memory and fused attention in real LLM serving. A fused rotation-quantization kernel matches plain INT4 throughput while recovering nearly all accuracy lost to naive INT4.

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Mon, 27 Apr 2026 05:20:15 +0000

arXiv: 2604.19844 · PDF

Authors: Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, Hammond Pearce

Primary category: cs.CV · all: cs.AI, cs.CV

Matched keywords: agent, agentic, multi-agent, serving, ai system

TL;DR

This paper identifies “trust boundary confusion” in Vision-Language Agentic Systems (VLAS), where agents fail to distinguish legitimate environmental signals (e.g., traffic lights) from adversarial visual injections. The authors propose a multi-agent defense that separates perception from decision-making, improving robustness while preserving responsiveness to genuine cues.

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Mon, 27 Apr 2026 05:19:41 +0000

arXiv: 2604.20022 · PDF

Authors: Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Primary category: cs.LG · all: cs.AI, cs.CL, cs.LG

Matched keywords: large language model, llm, agent, rag, reasoning, inference

TL;DR

BMBE splits medical dialogue into an LLM “sensor” that parses utterances and a deterministic Bayesian engine that handles all diagnostic inference, yielding calibrated, private, and robust diagnosis that beats frontier standalone LLMs at a fraction of the cost.

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Mon, 27 Apr 2026 05:19:13 +0000

arXiv: 2604.19689 · PDF

Authors: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, reasoning, ai system

TL;DR

A-MAR is an agent-based multimodal retrieval framework that decomposes artwork queries into structured reasoning plans, then conditions retrieval on each step to produce grounded, interpretable explanations. It outperforms static retrieval and MLLM baselines on SemArt, Artpedia, and a new ArtCoT-QA benchmark.

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Mon, 27 Apr 2026 05:18:43 +0000

arXiv: 2604.19299 · PDF

Authors: Xinlin Wang, Mats Brorsson

Primary category: cs.CL · all: cs.AI, cs.CL

Matched keywords: large language model, agent, multi-agent, tool use, reasoning, latency, fine-tun

TL;DR

This paper presents the first large-scale empirical study of sub-10B open-source SLMs across three deployment paradigms—base, single-agent with tools, and multi-agent collaboration—finding that single-agent systems offer the best cost/performance balance while multi-agent setups add overhead with limited gains.

Key Ideas

SLMs (<10B params) are viable LLM alternatives if their weaknesses are compensated by agent paradigms rather than pure scaling or fine-tuning.
Tool-augmented single agents systematically outperform base SLMs at modest extra cost.
Multi-agent collaboration yields diminishing returns relative to its computational overhead.
Deployment efficiency is a first-class design criterion for trustworthy SLM systems.

Approach

The authors benchmark open-source SLMs under three paradigms: (1) bare base model, (2) a single agent equipped with external tools, and (3) a multi-agent collaborative system. They compare performance and cost across these configurations, though the abstract does not specify which tools, orchestration framework, or agent protocols are used.

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Mon, 27 Apr 2026 05:18:10 +0000

arXiv: 2604.19398 · PDF

Authors: Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang Li, Rui Mao, Jianbin Qin

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, rag, inference, kv cache, attention, gpu, latency, fine-tun

TL;DR

GRASPrune is a post-pretraining structured pruning framework that jointly prunes FFN channels and KV head groups under a single global budget using projected straight-through gate learning, producing a smaller dense checkpoint without fine-tuning the backbone.

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

Mon, 27 Apr 2026 05:17:33 +0000

arXiv: 2604.19856 · PDF

Authors: Cagri Eryilmaz

Primary category: cs.AR · all: cs.AI, cs.AR, cs.LG

Matched keywords: large language model, llm, agent, agentic, multi-agent, retrieval, rag, reasoning

TL;DR

ChipCraftBrain is a multi-agent RTL generation framework combining PPO-driven orchestration, symbolic-neural reasoning, and knowledge retrieval. It hits 97.2% pass@1 on VerilogEval-Human and 94.7% on a 302-problem CVDP subset, outperforming MAGE and matching ChipAgents while using far fewer attempts than NVIDIA’s ACE-RTL.

Key Ideas

Adaptive orchestration of six specialized agents via a PPO policy over a 168-dim state (with an MPC world-model alternative).
Hybrid symbolic-neural architecture: algorithmic solvers for K-maps/truth tables, neural agents for waveforms and general RTL.
Knowledge-augmented retrieval from 321 patterns + 971 open-source reference implementations with focus-aware lookup.
Hierarchical spec decomposition into dependency-ordered sub-modules with interface synchronization.

Approach

A controller learns (PPO) to route tasks among six agents depending on problem state. Symbolic solvers handle combinational logic exactly; neural agents handle timing/waveforms. A retrieval module injects reference patterns. Complex specs are decomposed hierarchically with cross-module interface synchronization before code generation and validation.