2026-05-29 Paper Digest

845 arXiv papers on agent / LLM / AI infra submitted that day matched our topic filter. 10 were hand-picked by Claude — using title + authors + affiliations — and received a full Claude-generated analysis; the remaining 835 are listed at the bottom.

1. Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

arXiv: 2602.00994 · cs.AI · Claude pick

在 Agentic RL 中，推理（reasoning）与工具调用（tool-use）共享参数会产生梯度方向冲突，导致联合优化效果下降。作者量化了这一干扰，并提出 DART——用两个独立 LoRA 适配器分别承接两类梯度——在 13 个 benchmark 上超越所有联合优化基线。

Read detailed analysis →

2. The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

arXiv: 2605.29491 · cs.AI · Claude pick

Larger LLMs are systematically less robust to instruction-like noise embedded in reference text — a “Curse of Helpfulness” — which the new DistractionIF benchmark quantifies; GRPO-based RL partially recovers up to 15.5% robustness without hurting general instruction following.

Read detailed analysis →

3. Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

arXiv: 2605.24846 · cs.LG · Claude pick

A tiny, cross-task subset of neurons (< 0.2% of all neurons) called “keystone neurons” can be identified in open-weight LLMs with just four prompts; removing them collapses all model capabilities, while fine-tuning only them matches or exceeds full-parameter fine-tuning.

Read detailed analysis →

4. RTP-LLM: High-Performance Alibaba LLM Inference Engine

arXiv: 2605.29639 · cs.OS · Claude pick

RTP-LLM is Alibaba’s production LLM inference engine, serving 100M+ users, that integrates prefill-decode disaggregation, multi-tiered KV cache, speculative decoding, and model-loading optimizations to deliver 4.7×–6.3× faster loading, 35–40% latency reduction, and substantial throughput gains over vLLM and SGLang.

Read detailed analysis →

5. GrepSeek: Training Search Agents for Direct Corpus Interaction

arXiv: 2605.29307 · cs.CL · Claude pick

GrepSeek trains a compact LLM to search large text corpora by issuing shell commands (rg, grep) directly against raw text, bypassing pre-computed indices, using a cold-start SFT + GRPO two-stage pipeline and a 7.6× sharded-parallel execution engine.

Read detailed analysis →

6. SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

arXiv: 2604.09557 · cs.DC · Claude pick

SPEED-Bench 是一个专为投机解码（Speculative Decoding）设计的综合评测套件，通过语义多样性驱动的数据策划与生产级引擎集成，解决现有基准在多样性、吞吐量评估和真实环境代表性上的系统性缺陷。

Read detailed analysis →

7. ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

arXiv: 2604.13519 · cs.CL · Claude pick

ToolSpec 是一种免训练的推测解码方法，通过有限状态机利用预定义工具 schema 确定性地生成草稿 token，并结合历史调用检索，将工具调用生成速度提升最高 4.2×。

Read detailed analysis →

8. RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

arXiv: 2603.18859 · cs.AI · Claude pick

RewardFlow builds a state graph from sampled agentic trajectories and propagates BFS-based rewards from success nodes to intermediate states, providing annotation-free dense process rewards that improve RL training across four agentic benchmarks without any reward model.

Read detailed analysis →

9. SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

arXiv: 2605.29796 · cs.AI · Claude pick

SAAS is an RL framework that teaches agentic search models when not to search by dynamically tracking the agent’s evolving knowledge boundary and converting that awareness into discriminative trajectory-level penalties, reducing over-search without accuracy loss.

Read detailed analysis →

10. When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

arXiv: 2605.30102 · cs.MA · Claude pick

This position/workshop paper systematically examines the design space of hybrid multi-agent systems (MAS) that mix cloud-hosted frontier LLMs with on-device SLMs, finding that no single hybrid architecture dominates across tasks and that more cloud compute does not reliably improve performance.

Read detailed analysis →

Other matched papers

These papers matched the same topic keywords but were not among Claude’s top-N deep-analysis picks.

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling · cs.AI · arXiv 2605.29262 · score 32 — large language model, llm, agent, agentic, retrieval, rag
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows · cs.CV · arXiv 2605.14113 · score 30 — large language model, llm, agent, agentic, retrieval, rag
SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow · cs.CL · arXiv 2605.29368 · score 29 — large language model, llm, agent, multi-agent, retrieval, reasoning
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices · cs.AI · arXiv 2605.29705 · score 28 — large language model, llm, multi-agent, rag, reasoning, inference
Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation · cs.AI · arXiv 2605.29873 · score 27 — large language model, llm, reasoning, serving, kv cache, attention
Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection · cs.AI · arXiv 2605.30042 · score 23 — large language model, llm, agent, multi-agent, rag, serving
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios · cs.AI · arXiv 2605.27995 · score 28 — large language model, llm, agent, tool use, tool-use, reasoning
CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems · cs.MA · arXiv 2605.29612 · score 27 — large language model, llm, agent, multi-agent, rag, latency
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs · cs.AI · arXiv 2605.29512 · score 26 — large language model, llm, agent, multi-agent, reasoning, inference
CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective · cs.CL · arXiv 2502.03805 · score 26 — large language model, llm, rag, inference, kv cache, attention
Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems · cs.AI · arXiv 2605.29676 · score 21 — large language model, llm, agent, agentic, ai system
KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning · cs.AI · arXiv 2605.30002 · score 25 — large language model, llm, agent, agentic, rag, reasoning
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization · cs.MA · arXiv 2605.30227 · score 25 — large language model, llm, agent, multi-agent, rag, reasoning
MediHive: A Decentralized Agent Collective for Medical Reasoning · cs.AI · arXiv 2603.27150 · score 21 — large language model, llm, agent, multi-agent, rag, reasoning
MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration · cs.AI · arXiv 2604.14889 · score 21 — llm, rag, reasoning, chain-of-thought, inference, serving
AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials · cs.AI · arXiv 2510.04704 · score 26 — large language model, llm, agent, agentic, retrieval, reasoning
DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration · cs.MA · arXiv 2605.29511 · score 30 — llm, agent, multi-agent, reasoning, inference, gpu
BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference · cs.LG · arXiv 2605.29233 · score 20 — llm, rag, inference, serving, kv-cache, parallelism
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation · cs.CL · arXiv 2605.29861 · score 24 — large language model, llm, agent, multi-agent, tool use
ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation · cs.CV · arXiv 2604.11080 · score 20 — large language model, llm, rag, inference, quantization, attention
Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction · cs.CL · arXiv 2605.25297 · score 20 — llm, agent, agentic, reasoning, chain-of-thought, gpu
DFlash: Block Diffusion for Flash Speculative Decoding · cs.CL · arXiv 2602.06036 · score 20 — large language model, llm, inference, speculative decoding, gpu, latency
Accelerating Sparse Transformer Inference on GPU · cs.LG · arXiv 2506.06095 · score 20 — large language model, llm, rag, inference, attention, transformer
VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis · cs.AI · arXiv 2605.28978 · score 19 — large language model, llm, agent, multi-agent, reasoning
Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models · cs.AI · arXiv 2605.29625 · score 19 — large language model, llm, agent, multi-agent, attention
Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent · cs.AI · arXiv 2605.29966 · score 19 — large language model, llm, agent, rag, reasoning, fine-tun
MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery · cs.CL · arXiv 2605.29475 · score 19 — large language model, llm, agent, agentic, rag
Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding · cs.CL · arXiv 2605.29707 · score 19 — llm, inference, serving, speculative decoding, transformer, throughput
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems · cs.CL · arXiv 2602.15382 · score 19 — large language model, agent, multi-agent, rag, reasoning, quantization
Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents · cs.LG · arXiv 2605.28850 · score 19 — large language model, llm, agent, reasoning, transformer, fine-tun
Robust and Efficient Guardrails with Latent Reasoning · cs.AI · arXiv 2605.29068 · score 18 — large language model, llm, reasoning, inference, throughput, latency
Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies · cs.CL · arXiv 2605.29712 · score 18 — large language model, llm, retrieval, reasoning, inference, fine-tun
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots · cs.CR · arXiv 2605.29963 · score 18 — llm, agent, agentic, rag, serving
Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies · cs.LG · arXiv 2605.30148 · score 18 — large language model, llm, inference, serving, fine-tun
E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing · cs.LG · arXiv 2512.03109 · score 18 — llm, agent, agentic, reasoning, ai system
Molecular Lead Optimization via Agentic Tool Planning · cs.LG · arXiv 2605.28862 · score 18 — llm, agent, agentic, reasoning, serving
FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models · cs.LG · arXiv 2511.11505 · score 18 — rag, inference, serving, parallelism, mixture of experts, moe
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization · cs.AI · arXiv 2605.29396 · score 17 — large language model, llm, rag, serving, quantization
Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation · cs.AI · arXiv 2605.29560 · score 17 — large language model, llm, agent, rag, reasoning
VikingMem: A Memory Base Management System for Stateful LLM-based Applications · cs.AI · arXiv 2605.29640 · score 17 — large language model, llm, agent, retrieval, latency
OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation · cs.AI · arXiv 2605.29829 · score 17 — large language model, llm, agent, rag, reasoning
Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning · cs.AI · arXiv 2605.30039 · score 17 — large language model, llm, rag, serving, fine-tun
Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance · cs.AI · arXiv 2605.30187 · score 17 — large language model, llm, agent, agentic
GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling · cs.CL · arXiv 2605.28835 · score 17 — large language model, llm, multi-agent, rag, fine-tun
Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning · cs.CL · arXiv 2605.28842 · score 17 — large language model, llm, reasoning, chain-of-thought, serving
First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope · cs.AI · arXiv 2605.28916 · score 17 — large language model, agent, agentic, ai system
Conf-Gen: Conformal Uncertainty Quantification for Generative Models · cs.LG · arXiv 2605.28920 · score 17 — large language model, llm, agent, ai system
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents · cs.CL · arXiv 2605.29224 · score 17 — large language model, llm, agent, retrieval, rag
Training Deliberative Monitors for Black-Box Scheming Detection · cs.CL · arXiv 2605.29601 · score 17 — agent, agentic, reasoning, chain-of-thought, inference, fine-tun
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction · cs.CR · arXiv 2605.29960 · score 17 — large language model, llm, agent, rag, attention
InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents · cs.AI · arXiv 2511.22884 · score 17 — large language model, llm, agent, multi-agent
Small Agent Group is the Future of Digital Health · cs.AI · arXiv 2602.08013 · score 17 — large language model, llm, agent, retrieval, reasoning
FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research · cs.AI · arXiv 2605.27864 · score 17 — large language model, llm, agent, serving
Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence · cs.CR · arXiv 2509.23573 · score 17 — large language model, llm, agent, rag, reasoning
When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making · cs.RO · arXiv 2603.16673 · score 17 — large language model, llm, agent, reasoning, latency
Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies · cs.CL · arXiv 2605.29062 · score 17 — large language model, llm, agent, multi-agent
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction · cs.CV · arXiv 2605.29341 · score 17 — large language model, agent, agentic, retrieval, rag
ValueFlow: Measuring the Propagation of Value Perturbations in Multi-Agent LLM Systems · cs.MA · arXiv 2602.08567 · score 17 — large language model, llm, agent, multi-agent
RAT+: Train Dense, Infer Sparse – Recurrence Augmented Attention for Dilated Inference · cs.LG · arXiv 2602.18196 · score 17 — rag, reasoning, inference, serving, kv cache, attention
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching · cs.AI · arXiv 2605.29055 · score 16 — llm, agent, agentic, multi-agent
DenseSteer: Steering Small Language Models towards Dense Math Reasoning · cs.AI · arXiv 2605.29247 · score 16 — large language model, llm, reasoning, chain-of-thought, inference
Enhancing Multi-Agent Communication through Attention Steering with Context Relevance · cs.AI · arXiv 2605.30136 · score 16 — llm, agent, multi-agent, reasoning, attention
Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models · cs.CV · arXiv 2605.29299 · score 16 — large language model, rag, inference, serving, latency
Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge · cs.CV · arXiv 2605.29402 · score 16 — large language model, llm, retrieval, reasoning, inference
Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage · cs.CR · arXiv 2605.30040 · score 16 — large language model, llm, rag, reasoning, inference
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning · cs.AI · arXiv 2602.23258 · score 16 — agent, multi-agent, retrieval, rag, reasoning, fine-tun
Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop · cs.PL · arXiv 2601.17670 · score 16 — large language model, llm, retrieval, rag, compiler
CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs · cs.MA · arXiv 2605.09823 · score 16 — llm, agent, multi-agent, serving
EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI · cs.CL · arXiv 2511.08949 · score 16 — large language model, llm, rag, inference, fine-tun
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models · cs.AI · arXiv 2605.29303 · score 15 — large language model, reasoning, serving, fine-tun, post-train
NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs · cs.AI · arXiv 2605.29716 · score 15 — large language model, llm, reasoning, latency, fine-tun
Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering · cs.AI · arXiv 2605.29742 · score 15 — large language model, llm, retrieval, rag, reasoning
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence · cs.AI · arXiv 2605.29744 · score 15 — large language model, llm, multi-agent, reasoning
LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs · cs.AI · arXiv 2605.29756 · score 15 — large language model, llm, quantization, transformer, post-train
From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs · cs.AI · arXiv 2605.30014 · score 15 — large language model, llm, rag, quantization, fine-tun
PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers · cs.AI · arXiv 2605.30094 · score 15 — large language model, llm, agent, rag
Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models · cs.LG · arXiv 2605.28866 · score 15 — large language model, llm, reasoning, serving
Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models · cs.CL · arXiv 2605.30251 · score 15 — large language model, llm, rag, serving
LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback · cs.HC · arXiv 2605.30273 · score 15 — large language model, llm, rag, serving
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving · cs.RO · arXiv 2605.30326 · score 15 — agent, multi-agent, tool use, reasoning, fine-tun
PersonaAgent: Bridging Memory and Action for Personalized LLM Agents · cs.AI · arXiv 2506.06254 · score 15 — large language model, llm, agent, rag
A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models · cs.AI · arXiv 2511.08548 · score 15 — large language model, llm, reasoning, ai system
SCOPE: Prompt Evolution for Enhancing Agent Effectiveness · cs.AI · arXiv 2512.15374 · score 15 — large language model, llm, agent, rag
AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents · cs.AI · arXiv 2602.02849 · score 15 — large language model, llm, agent, reasoning
Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs · cs.AI · arXiv 2602.02909 · score 15 — llm, reasoning, chain-of-thought, inference, attention, latency
MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation · cs.AI · arXiv 2603.23234 · score 15 — llm, agent, retrieval, reasoning, inference
Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans · cs.CE · arXiv 2410.10398 · score 15 — large language model, llm, agent, reasoning
Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems · cs.CY · arXiv 2501.10332 · score 15 — large language model, llm, agent, rag
GroundAct: Can LLM Agents Ground Actions in Environmental States? · cs.CL · arXiv 2508.05614 · score 15 — llm, agent, tool use, reasoning, fine-tun
Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting · cs.CR · arXiv 2509.23571 · score 15 — large language model, llm, agent, reasoning
Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers · cs.CL · arXiv 2601.22139 · score 15 — large language model, llm, reasoning, chain-of-thought, fine-tun
Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training · cs.LG · arXiv 2603.00454 · score 15 — large language model, llm, serving, fine-tun
Combating Data Laundering in LLM Training · cs.CR · arXiv 2604.01904 · score 15 — large language model, llm, rag, serving
The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown · cs.AI · arXiv 2604.04956 · score 15 — large language model, llm, agent, reasoning
BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving · cs.AI · arXiv 2605.27480 · score 15 — large language model, llm, serving, gpu
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning · cs.CV · arXiv 2605.27959 · score 15 — large language model, llm, rag, reasoning, attention
RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment · cs.CL · arXiv 2605.28827 · score 15 — large language model, llm, quantization, attention, fine-tun
Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction · cs.CL · arXiv 2605.29000 · score 15 — large language model, llm, serving, fine-tun
Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework · cs.CL · arXiv 2605.29397 · score 15 — llm, agent, rag, inference, latency
From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals · cs.CL · arXiv 2605.29555 · score 15 — large language model, llm, retrieval, reasoning, throughput
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge? · cs.CL · arXiv 2605.30104 · score 15 — llm, agent, tool-use, reasoning, latency
Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge · cs.CL · arXiv 2505.16178 · score 15 — large language model, llm, retrieval, rag, fine-tun
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference · cs.CL · arXiv 2510.24606 · score 15 — llm, inference, serving, attention, gpu
How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning · cs.LG · arXiv 2602.02103 · score 15 — large language model, llm, rag, reasoning, chain-of-thought
K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance · cs.LG · arXiv 2605.29523 · score 15 — large language model, llm, retrieval, rag, fine-tun
OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction · cs.LG · arXiv 2605.30247 · score 15 — large language model, llm, retrieval, rag, reasoning
Echoes within the Reasoning: Stealthy and Effective Watermarking via Chain of Thought · cs.CR · arXiv 2605.28890 · score 15 — large language model, rag, reasoning, chain-of-thought, quantization, fine-tun
The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane · cs.AI · arXiv 2605.29082 · score 14 — agent, agentic, multi-agent, throughput
Beyond Consensus: Trace-Level Synthesis in Mixture of Agents · cs.AI · arXiv 2605.29116 · score 14 — llm, agent, reasoning, serving
Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction · cs.AI · arXiv 2605.29168 · score 14 — llm, retrieval, rag, reasoning, serving
Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces · cs.AI · arXiv 2605.29288 · score 14 — llm, reasoning, chain-of-thought, serving, fine-tun
PassNet: Scaling Large Language Models for Graph Compiler Pass Generation · cs.AI · arXiv 2605.29357 · score 14 — large language model, llm, compiler, fine-tun
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation · cs.AI · arXiv 2605.29430 · score 14 — llm, agent, agentic, reasoning
ParaTool: Shifting Tool Representations from Context to Parameters · cs.AI · arXiv 2605.29561 · score 14 — large language model, llm, inference, fine-tun
AgentSchool: An LLM-Powered Multi-Agent Simulation for Education · cs.AI · arXiv 2605.30144 · score 14 — llm, agent, multi-agent, reasoning
Hallucination Detection-Guided Preference Optimization for Clinical Summarization · cs.CL · arXiv 2605.28910 · score 14 — large language model, llm, rag, inference
CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models · cs.LG · arXiv 2605.28919 · score 14 — large language model, reasoning, inference, attention, transformer
SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers · cs.SE · arXiv 2605.29059 · score 14 — large language model, llm, reasoning, compiler
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension · cs.MA · arXiv 2605.29874 · score 14 — llm, agent, multi-agent, rag
Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents · cs.SE · arXiv 2605.29910 · score 14 — llm, agent, multi-agent, reasoning
Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency · cs.SE · arXiv 2605.30208 · score 14 — llm, agent, agentic, latency
Enhancing LLM Medical Coding with Structured External Knowledge · cs.CL · arXiv 2605.27377 · score 14 — llm, agent, agentic, rag
EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation · cs.CL · arXiv 2605.27390 · score 14 — large language model, retrieval, inference, speculative decoding
Draft-OPD: On-Policy Distillation for Speculative Draft Models · cs.CL · arXiv 2605.29343 · score 14 — large language model, inference, speculative decoding, fine-tun
Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting · cs.CL · arXiv 2605.29498 · score 14 — large language model, llm, inference, fine-tun
ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral Validation · cs.CL · arXiv 2605.29791 · score 14 — large language model, llm, reasoning, inference
CCS: Clinical Consensus Selection for Radiology Report Generation · cs.CL · arXiv 2605.30131 · score 14 — large language model, llm, retrieval, inference
Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning · cs.CL · arXiv 2605.30245 · score 14 — large language model, llm, reasoning, inference
Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning · cs.CL · arXiv 2604.06805 · score 14 — llm, rag, reasoning, chain-of-thought, kv cache
Inferring the Size of Large Language Models From Popular Text Memorization · cs.LG · arXiv 2605.29223 · score 14 — large language model, llm, rag, inference
On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference · cs.LG · arXiv 2605.29580 · score 14 — large language model, rag, reasoning, inference, fine-tun
Fingerprinting Inference Systems of Large Language Models · cs.CR · arXiv 2605.29979 · score 14 — large language model, llm, inference, attention
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts · cs.LG · arXiv 2605.15422 · score 14 — parallelism, moe, attention, gpu, cuda, post-train
TC-MIS: Maximal Independent Set on Tensor-cores · cs.DC · arXiv 2605.29604 · score 14 — rag, inference, parallelism, gpu, cuda, throughput
Provably Secure Agent Guardrail · cs.AI · arXiv 2605.29251 · score 13 — large language model, agent, reasoning, latency
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression · cs.AI · arXiv 2605.29350 · score 13 — rag, serving, moe, fine-tun, post-train
EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics · cs.AI · arXiv 2605.29394 · score 13 — large language model, llm, reasoning, fine-tun
When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs · cs.AI · arXiv 2605.29420 · score 13 — large language model, llm, retrieval, rag
VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data · cs.AI · arXiv 2605.29483 · score 13 — agent, agentic, tool use, reasoning
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning · cs.AI · arXiv 2605.29649 · score 13 — large language model, llm, rag, reasoning
TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation · cs.AI · arXiv 2605.29656 · score 13 — large language model, llm, reasoning, chain-of-thought
Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability · cs.AI · arXiv 2605.29687 · score 13 — large language model, llm, reasoning, chain-of-thought
Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories · cs.AI · arXiv 2605.29893 · score 13 — llm, agent, tool use, reasoning
ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure · cs.AI · arXiv 2605.30284 · score 13 — large language model, llm, retrieval, reasoning
Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models · cs.CL · arXiv 2605.28828 · score 13 — large language model, llm, retrieval, reasoning
How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines · cs.CL · arXiv 2605.28840 · score 13 — large language model, llm, agent
GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human · cs.CL · arXiv 2605.28882 · score 13 — large language model, llm, agent
Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era · cs.AI · arXiv 2605.29179 · score 13 — large language model, llm, serving
KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing · cs.CR · arXiv 2605.29524 · score 13 — large language model, llm, serving
SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring · cs.LG · arXiv 2605.29543 · score 13 — large language model, llm, reasoning, latency
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content · cs.LG · arXiv 2605.29659 · score 13 — large language model, llm, serving
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning · cs.LG · arXiv 2605.29782 · score 13 — large language model, llm, rag, post-train
Inferring Code Correctness from Specification · cs.SE · arXiv 2605.29822 · score 13 — large language model, llm, reasoning, chain-of-thought
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training · cs.LG · arXiv 2605.29888 · score 13 — large language model, llm, reasoning, post-train
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies · cs.CV · arXiv 2605.30011 · score 13 — reasoning, chain-of-thought, inference, serving, latency
Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection · cs.CL · arXiv 2605.30274 · score 13 — large language model, agent, rag, reasoning
PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data · cs.AI · arXiv 2508.15180 · score 13 — large language model, llm, rag, reasoning
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance · cs.AI · arXiv 2509.23730 · score 13 — large language model, llm, rag, reasoning
Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting · cs.AI · arXiv 2510.02480 · score 13 — large language model, llm, rag, attention
CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization · cs.AI · arXiv 2510.14150 · score 13 — large language model, llm, agent
Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation · cs.AI · arXiv 2604.10511 · score 13 — large language model, llm, reasoning, chain-of-thought
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models · cs.AI · arXiv 2605.24140 · score 13 — large language model, llm, reasoning, fine-tun
Soro: A Lightweight Foundation Model and Chatbot for Tajik · cs.AI · arXiv 2605.27379 · score 13 — large language model, llm, rag, quantization
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic · cs.AI · arXiv 2605.28700 · score 13 — large language model, llm, rag, reasoning
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models · cs.CR · arXiv 2410.15236 · score 13 — large language model, llm, multi-agent
Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders · cs.CL · arXiv 2602.10388 · score 13 — large language model, llm, rag, post-train
A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search · cs.CL · arXiv 2602.11171 · score 13 — large language model, llm, rag, fine-tun
Steering at the Source: Style Modulation Heads for Robust Persona Control · cs.CL · arXiv 2603.13249 · score 13 — large language model, llm, attention, fine-tun
P$^2$RAG: Efficient Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval · cs.CR · arXiv 2603.14778 · score 13 — large language model, retrieval, rag, serving
Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm · cs.IR · arXiv 2603.26668 · score 13 — large language model, llm, retrieval, rag
Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence · cs.LG · arXiv 2605.13230 · score 13 — large language model, llm, reasoning, post-train
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning · cs.CV · arXiv 2605.16385 · score 13 — llm, rag, reasoning, inference, attention
GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization · cs.LG · arXiv 2605.26092 · score 13 — large language model, llm, quantization, transformer
MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models · cs.CL · arXiv 2605.28825 · score 13 — large language model, llm, rag, reasoning
Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception · cs.CL · arXiv 2605.29064 · score 13 — large language model, llm, agent
Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation · cs.CL · arXiv 2605.29992 · score 13 — retrieval, inference, serving, transformer, gpu
PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning · cs.LG · arXiv 2605.29582 · score 13 — large language model, llm, agent
Catalyst-Agent: Autonomous heterogeneous catalyst screening with an LLM Agent · cs.CL · arXiv 2603.01311 · score 13 — llm, agent, tool use, rag
OpenCompass: A Universal Evaluation Platform for Large Language Models · cs.CL · arXiv 2605.19276 · score 13 — large language model, llm, rag, reasoning
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents · cs.CL · arXiv 2605.23657 · score 13 — large language model, llm, agent
Feedback-to-Rubrics: Can We Learn Expert Criteria from Inline Comments? · cs.LG · arXiv 2605.29857 · score 13 — large language model, llm, serving
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets · cs.LG · arXiv 2605.30289 · score 13 — large language model, retrieval, serving, transformer
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones? · cs.LG · arXiv 2605.30329 · score 13 — large language model, llm, agent
Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection · cs.CR · arXiv 2605.29901 · score 13 — large language model, llm, reasoning, attention
Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding · cs.LG · arXiv 2511.04934 · score 13 — large language model, llm, ai system
SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs · cs.DC · arXiv 2603.00357 · score 13 — llm, training system, parallelism, gpu
Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes · cs.AI · arXiv 2605.28965 · score 12 — llm, agent, agentic
Governing Technical Debt in Agentic AI Systems · cs.AI · arXiv 2605.29129 · score 12 — agent, agentic, ai system
CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval · cs.AI · arXiv 2605.29271 · score 12 — llm, agent, retrieval, fine-tun
Formalizing Mathematics at Scale · cs.AI · arXiv 2605.29955 · score 12 — llm, agent, multi-agent
Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison · cs.AI · arXiv 2605.30087 · score 12 — llm, agent, rag, reasoning
SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation · cs.CL · arXiv 2605.29146 · score 12 — llm, agent, multi-agent
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models · cs.LG · arXiv 2605.29398 · score 12 — large language model, llm, inference
SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents · cs.CL · arXiv 2605.29440 · score 12 — llm, agent, retrieval, rag
Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation · cs.CL · arXiv 2605.29502 · score 12 — llm, rag, serving, fine-tun
Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems · cs.MA · arXiv 2605.29790 · score 12 — llm, agent, multi-agent
HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization · cs.LG · arXiv 2605.29843 · score 12 — llm, serving, quantization, post-train
Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas · cs.MA · arXiv 2605.30003 · score 12 — llm, agent, multi-agent
No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval · cs.IR · arXiv 2605.30120 · score 12 — retrieval, rag, serving, throughput, latency
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms · cs.CY · arXiv 2605.30169 · score 12 — agent, agentic, multi-agent
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning · cs.CV · arXiv 2605.30231 · score 12 — llm, rag, reasoning, transformer, fine-tun
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents · cs.AI · arXiv 2602.01869 · score 12 — llm, agent, rag, reasoning
SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data · cs.AI · arXiv 2604.26645 · score 12 — agent, agentic, multi-agent
From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges · cs.CL · arXiv 2601.08654 · score 12 — large language model, llm, inference
Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR · cs.CL · arXiv 2602.12642 · score 12 — llm, rag, reasoning, scheduler, post-train
PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration · cs.CL · arXiv 2605.29313 · score 12 — llm, agent, multi-agent
Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design · cs.CL · arXiv 2605.29421 · score 12 — llm, agent, agentic
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering · cs.CL · arXiv 2605.30076 · score 12 — large language model, llm, inference
Rare Event Analysis of Large Language Models · cs.LG · arXiv 2602.06791 · score 12 — large language model, llm, inference
HPC-vQPU: A Service-Export Architecture for Virtual QPUs on Batch-Scheduled HPC Systems · cs.DC · arXiv 2605.28845 · score 12 — agent, serving, gpu, scheduler
Mind Your Tone: Does Tone Alter LLM Performance? · cs.AI · arXiv 2605.29027 · score 11 — large language model, llm, reasoning
GTA: Generating Long-Horizon Tasks for Web Agents at Scale · cs.AI · arXiv 2605.29218 · score 11 — agent, tool-use, retrieval, rag
Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility · cs.AI · arXiv 2605.29229 · score 11 — large language model, llm, reasoning
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing · cs.AI · arXiv 2605.29815 · score 11 — large language model, llm, rag
Harnessing non-adversarial robustness in large language models · cs.AI · arXiv 2605.29816 · score 11 — large language model, llm, fine-tun
Make LLM Learn to Synthesize from Streaming Experiences through Feedback · cs.AI · arXiv 2605.29940 · score 11 — large language model, llm, rag
Anchorless Diversification for Parallel LLM Ideation · cs.AI · arXiv 2605.30150 · score 11 — llm, inference, serving
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models · cs.AI · arXiv 2605.30219 · score 11 — large language model, llm, rag
S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering · cs.CL · arXiv 2605.28831 · score 11 — agent, retrieval, rag, inference
SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation · cs.CL · arXiv 2605.28837 · score 11 — large language model, llm, retrieval
GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models · cs.CL · arXiv 2605.28848 · score 11 — large language model, llm, retrieval
Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT? · cs.LG · arXiv 2605.28860 · score 11 — large language model, llm, fine-tun
Label-Free Reinforcement Learning via Cross-Model Entropy · cs.LG · arXiv 2605.29009 · score 11 — large language model, llm, post-train
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text · cs.CL · arXiv 2605.29076 · score 11 — llm, reasoning, inference, fine-tun
CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control · cs.RO · arXiv 2605.29155 · score 11 — inference, serving, cuda, latency
Parallax: Parameterized Local Linear Attention for Language Modeling · cs.LG · arXiv 2605.29157 · score 11 — large language model, llm, attention
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning · cs.CL · arXiv 2605.29170 · score 11 — large language model, llm, reasoning
DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents · cs.CL · arXiv 2605.29256 · score 11 — large language model, agent, rag
Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning · cs.CL · arXiv 2605.29414 · score 11 — large language model, llm, rag
Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment · cs.CL · arXiv 2605.29458 · score 11 — large language model, llm, reasoning
Projectional Decoding: Towards Semantic-Aware LLM Generation · cs.SE · arXiv 2605.30054 · score 11 — large language model, llm, reasoning
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings · cs.CL · arXiv 2605.30295 · score 11 — large language model, llm, reasoning
In-Context Reward Adaptation for Robust Preference Modeling · cs.LG · arXiv 2605.30323 · score 11 — large language model, rag, transformer, rlhf
LsrIF: Enhancing Logic-Structured Instruction Following of Large Language Models · cs.AI · arXiv 2601.06431 · score 11 — large language model, rag, reasoning, attention
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents · cs.AI · arXiv 2604.05157 · score 11 — large language model, agent, rag
Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling · cs.AI · arXiv 2604.25098 · score 11 — large language model, llm, reasoning
Hierarchical Task Network Planning with LLM-Generated Heuristics · cs.AI · arXiv 2605.07707 · score 11 — large language model, llm, rag
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives · cs.GT · arXiv 2505.21627 · score 11 — large language model, llm, rag
An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning · cs.CV · arXiv 2511.10861 · score 11 — rag, inference, serving, fine-tun
Differential syntactic and semantic encoding in LLMs · cs.CL · arXiv 2601.04765 · score 11 — large language model, llm, rag
Thinking Before Constraining: A Unified Decoding Framework for Large Language Models · cs.CL · arXiv 2601.07525 · score 11 — large language model, llm, reasoning
Who can we trust? LLM-as-a-jury for Comparative Assessment · cs.CL · arXiv 2602.16610 · score 11 — large language model, llm, rag
JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments · cs.CV · arXiv 2602.18527 · score 11 — large language model, llm, reasoning
Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data · cs.LG · arXiv 2603.19294 · score 11 — large language model, llm, post-train
The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More · cs.CL · arXiv 2603.23971 · score 11 — agent, rag, reasoning, inference
SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits · cs.CR · arXiv 2604.01473 · score 11 — large language model, llm, latency
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories · cs.CL · arXiv 2604.20443 · score 11 — llm, rag, reasoning, inference
Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment · cs.CL · arXiv 2605.28822 · score 11 — large language model, llm, fine-tun
The Trust Paradox: How CS Researchers Engage LLM Leaderboards · cs.CL · arXiv 2605.28966 · score 11 — large language model, llm, rag
Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization · cs.CL · arXiv 2605.29327 · score 11 — large language model, llm, reasoning
FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions · cs.CL · arXiv 2605.29427 · score 11 — large language model, llm, fine-tun
Comparative Evaluation of Machine Translation Systems on Images with Text · cs.CL · arXiv 2605.29476 · score 11 — large language model, llm, reasoning
Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese · cs.CL · arXiv 2605.29667 · score 11 — large language model, llm, rag
Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models? · cs.CL · arXiv 2605.29678 · score 11 — large language model, llm, reasoning
Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs · cs.CL · arXiv 2605.29708 · score 11 — llm, serving, moe
Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels · cs.CL · arXiv 2605.29800 · score 11 — llm, reasoning, chain-of-thought, inference
Latent Performance Profiling of Large Language Models · cs.CL · arXiv 2605.30018 · score 11 — large language model, llm, reasoning
Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues · cs.CL · arXiv 2605.30051 · score 11 — large language model, llm, rag
CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild · cs.CL · arXiv 2605.30241 · score 11 — llm, retrieval, rag, inference
Implicit Identity Technologies for LLMs: Fingerprinting and Watermarking across Datasets, Models, and Generated Content · cs.CR · arXiv 2605.29245 · score 11 — large language model, llm, rag
Understanding the Ability of LLMs to Handle Character-Level Perturbation · cs.CL · arXiv 2510.14365 · score 11 — large language model, llm, rag
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models · cs.CL · arXiv 2512.00837 · score 11 — large language model, llm, rag
“Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs · cs.CL · arXiv 2602.04729 · score 11 — large language model, llm, rag
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing · cs.CL · arXiv 2603.17942 · score 11 — large language model, llm, throughput
HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation · cs.CL · arXiv 2604.09629 · score 11 — large language model, llm, fine-tun
When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance · cs.CL · arXiv 2605.22975 · score 11 — large language model, llm, rag
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench · cs.LG · arXiv 2601.20255 · score 11 — large language model, llm, fine-tun
On-Policy Replay for Continual Supervised Fine-Tuning · cs.LG · arXiv 2605.29495 · score 11 — large language model, llm, fine-tun
Convergence Theory for Iterative LLM-Based Neural Architecture Search: A Parametric Cross-Entropy Framework with Closed-Form Proxy Reliability · cs.LG · arXiv 2605.30103 · score 11 — large language model, llm, fine-tun
The Biosecurity Blind Spot: Systematic Dual-use Detection in Open Science Infrastructure · cs.DL · arXiv 2605.28843 · score 11 — large language model, llm, rag
Generative Spatiotemporal Intent Sequence Recommendation via Implicit Reasoning in Amap · cs.IR · arXiv 2605.28888 · score 11 — llm, reasoning, inference, latency
TabPFN-3: Technical Report · cs.LG · arXiv 2605.13986 · score 11 — llm, inference, kv cache
Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies · cs.AI · arXiv 2605.29270 · score 10 — llm, agent, retrieval
DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation · cs.AI · arXiv 2605.29522 · score 10 — agent, agentic, retrieval
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents · cs.AI · arXiv 2605.30159 · score 10 — llm, agent, reasoning
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents · cs.AI · arXiv 2605.30335 · score 10 — llm, agent, retrieval
No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand · cs.CL · arXiv 2605.28836 · score 10 — multi-agent, serving, attention
LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis · cs.SE · arXiv 2605.28876 · score 10 — llm, agent, rag
OISD: On-Policy Internal Self-Distillation of Language Models · cs.LG · arXiv 2605.29089 · score 10 — reasoning, serving, attention, post-train
unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning · cs.CR · arXiv 2605.29115 · score 10 — llm, agent, fine-tun
Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA · cs.SE · arXiv 2605.29277 · score 10 — llm, agent, reasoning
From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration · cs.HC · arXiv 2605.29675 · score 10 — agent, retrieval, ai system
CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation · cs.CL · arXiv 2605.29886 · score 10 — llm, retrieval, rag, reasoning
Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor? · cs.CL · arXiv 2605.30152 · score 10 — llm, agent, gpu
On Distributional Reinforcement Learning in Chaotic Dynamical Systems · cs.LG · arXiv 2605.30160 · score 10 — llm, multi-agent, rag
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection · cs.CR · arXiv 2605.30189 · score 10 — llm, serving, fine-tun
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion · cs.CV · arXiv 2605.30351 · score 10 — kv cache, attention, throughput, latency
Graph-Enhanced Policy Optimization in LLM Agent Training · cs.AI · arXiv 2510.26270 · score 10 — llm, agent, rag
From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning · cs.AI · arXiv 2601.21909 · score 10 — llm, reasoning, fine-tun, post-train
SIA: Self Improving AI with Harness & Weight Updates · cs.AI · arXiv 2605.27276 · score 10 — agent, agentic, gpu
Scaling Small Agents Through Strategy Auctions · cs.MA · arXiv 2602.02751 · score 10 — agent, agentic, rag
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn · cs.CL · arXiv 2605.13511 · score 10 — llm, retrieval, reasoning, chain-of-thought
Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation · cs.CL · arXiv 2605.29007 · score 10 — llm, agent, reasoning
Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs · cs.CL · arXiv 2605.30021 · score 10 — llm, serving, post-train
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology? · cs.CL · arXiv 2605.30058 · score 10 — llm, agent, reasoning
DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation · cs.CL · arXiv 2605.30090 · score 10 — llm, multi-agent, rag
When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer · cs.LG · arXiv 2605.29190 · score 10 — llm, reasoning, chain-of-thought, post-train
Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs · cs.CR · arXiv 2605.29737 · score 10 — llm, agent, rag
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding · cs.CL · arXiv 2512.17220 · score 10 — llm, retrieval, rag, reasoning
SEEK: Semantic Evidence Extraction via Adaptive ChunKing for Multilingual Fact-Checking · cs.CL · arXiv 2605.26755 · score 10 — llm, rag, serving
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models · cs.IR · arXiv 2605.07210 · score 10 — llm, retrieval, attention, latency
Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting · cs.LG · arXiv 2605.29401 · score 10 — llm, reasoning, fine-tun, post-train
Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills · cs.CR · arXiv 2605.29354 · score 10 — llm, agent, rag
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents · cs.LG · arXiv 2605.14241 · score 10 — llm, agent, latency
When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis · cs.AI · arXiv 2605.29025 · score 9 — large language model, llm
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models · cs.AI · arXiv 2605.29123 · score 9 — reasoning, inference, serving
Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification · cs.AI · arXiv 2605.29556 · score 9 — large language model, llm
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification · cs.AI · arXiv 2605.29586 · score 9 — large language model, llm
Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation · cs.AI · arXiv 2605.29652 · score 9 — large language model, llm
NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs · cs.AI · arXiv 2605.29685 · score 9 — large language model, llm
Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment · cs.AI · arXiv 2605.29930 · score 9 — rag, inference, ai system
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs · cs.AI · arXiv 2605.30036 · score 9 — large language model, llm
Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers · cs.AI · arXiv 2605.30049 · score 9 — inference, serving, transformer
Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale · cs.AI · arXiv 2605.30200 · score 9 — large language model, llm
Demystifying Data Organization for Enhanced LLM Training · cs.AI · arXiv 2605.30334 · score 9 — large language model, llm
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations · cs.AI · arXiv 2605.30345 · score 9 — large language model, llm
Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning · cs.CL · arXiv 2605.28829 · score 9 — large language model, reasoning, post-train
Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation · cs.CL · arXiv 2605.28830 · score 9 — large language model, llm
GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization · cs.CR · arXiv 2605.29107 · score 9 — large language model, llm
Toward User Preference Alignment in LLM Recommendation via Explicit Context Feedback · cs.IR · arXiv 2605.29141 · score 9 — large language model, llm
Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback · cs.LG · arXiv 2605.29184 · score 9 — large language model, llm
SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing · cs.CR · arXiv 2605.29468 · score 9 — large language model, llm
VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models · cs.RO · arXiv 2605.29562 · score 9 — retrieval, inference, serving
Predicting Causal Effects from Natural Language Queries using Structured Representations · cs.CL · arXiv 2605.29631 · score 9 — large language model, llm
OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning · cs.CV · arXiv 2605.29657 · score 9 — inference, serving, attention
Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models · cs.CL · arXiv 2605.29826 · score 9 — large language model, llm
Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering · cs.CV · arXiv 2605.29881 · score 9 — rag, inference, attention, throughput
How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency · cs.CR · arXiv 2605.30096 · score 9 — large language model, llm
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding · cs.CV · arXiv 2605.30126 · score 9 — rag, inference, serving
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning · cs.CL · arXiv 2605.30260 · score 9 — large language model, llm
LLMSurgeon: Diagnosing Data Mixture of Large Language Models · cs.CL · arXiv 2605.30348 · score 9 — large language model, llm
Estimating the Empowerment of Language Model Agents · cs.AI · arXiv 2509.22504 · score 9 — agent, tool-use, rag
Benchmarking at the Edge of Comprehension · cs.AI · arXiv 2602.14307 · score 9 — large language model, llm
SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems · cs.AI · arXiv 2603.23853 · score 9 — reasoning, inference, ai system
Automatic Layer Selection for Hallucination Detection · cs.AI · arXiv 2605.26366 · score 9 — large language model, llm
Less Is More: Elevating RAG via Performance-Driven Context Compression · cs.CL · arXiv 2508.19282 · score 9 — large language model, retrieval, rag
Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations · cs.HC · arXiv 2510.20743 · score 9 — large language model, llm
CORE-T: COherent REtrieval of Tables for Text-to-SQL · cs.CL · arXiv 2601.13111 · score 9 — llm, retrieval, inference
Pushing the Limits of Block Rotations in Post-Training Quantization · cs.LG · arXiv 2601.22347 · score 9 — inference, quantization, transformer, post-train
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating · cs.CV · arXiv 2605.11723 · score 9 — reasoning, chain-of-thought, inference, fine-tun
Reducing Political Manipulation with Consistency Training · cs.CL · arXiv 2605.22771 · score 9 — large language model, llm
Large language models reorganize representational geometry during in-context learning · cs.CL · arXiv 2605.28854 · score 9 — large language model, llm
User-Aware Active Knowledge Acquisition for Emotional Support Dialogue · cs.CL · arXiv 2605.29715 · score 9 — large language model, rag, reasoning
AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation · cs.CL · arXiv 2605.29741 · score 9 — large language model, rag, fine-tun
DySem: Uncovering Dynamic Semantic Components via Multilingual Consensus for Calculating Semantic Textual Similarity · cs.CL · arXiv 2605.29751 · score 9 — large language model, llm
EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation · cs.CL · arXiv 2605.29847 · score 9 — large language model, llm
Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model · cs.CL · arXiv 2605.30080 · score 9 — large language model, llm
Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees · cs.CL · arXiv 2410.15761 · score 9 — large language model, llm
HaluNet: Learning Hallucination Risk from Internal Signals in LLM Question Answering · cs.CL · arXiv 2512.24562 · score 9 — large language model, llm
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts · cs.CL · arXiv 2604.26506 · score 9 — large language model, llm
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance · cs.CV · arXiv 2411.14279 · score 9 — llm, inference, attention
Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models · cs.LG · arXiv 2605.28896 · score 9 — large language model, transformer, fine-tun
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference · cs.LG · arXiv 2605.30218 · score 9 — llm, inference, latency
Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models · cs.LG · arXiv 2602.10520 · score 9 — llm, reasoning, inference
Enhancing LLM Training via Spectral Clipping · cs.LG · arXiv 2603.14315 · score 9 — large language model, llm
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance · cs.LG · arXiv 2605.00553 · score 9 — large language model, llm
PRIM: Meta-Learned Bayesian Root Cause Analysis · cs.LG · arXiv 2605.08786 · score 9 — rag, inference, transformer, fine-tun
PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference · cs.LG · arXiv 2605.18587 · score 9 — rag, inference, serving
Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory · cs.PF · arXiv 2605.29135 · score 9 — large language model, gpu, throughput
ReasonOps: Operator Segmentation for LLM Reasoning Traces · cs.AI · arXiv 2605.29192 · score 8 — llm, reasoning, chain-of-thought
BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents · cs.AI · arXiv 2605.29225 · score 8 — llm, agent
DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning · cs.AI · arXiv 2605.29568 · score 8 — llm, rag, reasoning
GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation · cs.AI · arXiv 2605.29578 · score 8 — llm, serving
PTCG-Bench: Can LLM Agents Master Pok'emon Trading Card Game? · cs.AI · arXiv 2605.29653 · score 8 — llm, agent
GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents · cs.AI · arXiv 2605.29668 · score 8 — llm, agent
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling · cs.AI · arXiv 2605.29697 · score 8 — agent, agentic
Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations · cs.AI · arXiv 2605.29786 · score 8 — llm, agent
SkillsInjector: Dynamic Skill Context Construction for LLM Agents · cs.AI · arXiv 2605.29794 · score 8 — llm, agent
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains · cs.AI · arXiv 2605.29795 · score 8 — agent, retrieval, rag
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security · cs.AI · arXiv 2605.29801 · score 8 — agent, agentic
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields · cs.AI · arXiv 2605.29833 · score 8 — llm, retrieval, reasoning
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation · cs.AI · arXiv 2605.30000 · score 8 — llm, agent
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization · cs.CL · arXiv 2605.28969 · score 8 — llm, agent
Real-rootedness of the Poincar'e polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof · math.AG · arXiv 2605.29151 · score 8 — agent, agentic
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources · cs.CL · arXiv 2605.29250 · score 8 — retrieval, rag, serving
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles · cs.HC · arXiv 2605.29473 · score 8 — llm, retrieval, rag
GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing · cs.SE · arXiv 2605.29532 · score 8 — llm, agent
Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory · cs.CL · arXiv 2605.29630 · score 8 — agent, retrieval, rag
Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents · cs.CL · arXiv 2605.29927 · score 8 — llm, agent
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models · cs.RO · arXiv 2605.30226 · score 8 — rag, serving, post-train
Gram: Assessing sabotage propensities via automated alignment auditing · cs.LG · arXiv 2605.30322 · score 8 — agent, agentic
SafeSearch: Automated Red-Teaming of LLM-Based Search Agents · cs.AI · arXiv 2509.23694 · score 8 — llm, agent
TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis · cs.AI · arXiv 2510.06063 · score 8 — rag, reasoning, serving
Causal-JEPA: Learning World Models through Object-Level Latent Masking · cs.AI · arXiv 2602.11389 · score 8 — agent, rag, reasoning
ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology · cs.AI · arXiv 2605.24399 · score 8 — reasoning, mixture of experts, moe
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists · cs.AI · arXiv 2605.26029 · score 8 — llm, agent
GRPO is Secretly a Process Reward Model · cs.LG · arXiv 2509.21154 · score 8 — llm, rag, reasoning
ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing · cs.LG · arXiv 2511.14584 · score 8 — llm, agent
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning · cs.LG · arXiv 2602.01058 · score 8 — llm, reasoning, post-train
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover · cs.LG · arXiv 2603.11331 · score 8 — large language model, inference
EvA: An Evidence-First Audio Understanding Paradigm for LALMs · cs.SD · arXiv 2603.27667 · score 8 — rag, reasoning, serving
Graph Memory Transformer (GMT) · cs.LG · arXiv 2604.23862 · score 8 — serving, attention, transformer
When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks · cs.CL · arXiv 2604.27272 · score 8 — llm, serving
Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning · cs.LG · arXiv 2605.07804 · score 8 — rag, reasoning, serving
KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition · cs.CR · arXiv 2605.25376 · score 8 — agent, multi-agent
Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges · cs.CR · arXiv 2605.26156 · score 8 — llm, serving
FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning · cs.CL · arXiv 2605.29317 · score 8 — serving, attention, fine-tun
On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training · cs.CL · arXiv 2605.29496 · score 8 — reasoning, chain-of-thought, fine-tun, post-train
GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering · cs.CL · arXiv 2605.29584 · score 8 — agent, agentic
Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering · cs.CL · arXiv 2605.29648 · score 8 — llm, rag, reasoning
HTAM: Hierarchical Transition-Attended Memory for Operator Optimization · cs.CL · arXiv 2605.29734 · score 8 — llm, gpu, cuda
GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German · cs.CL · arXiv 2605.30214 · score 8 — llm, rag, reasoning
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion · cs.CV · arXiv 2605.30265 · score 8 — rag, reasoning, serving
Procedural Pretraining: Warming Up Language Models with Abstract Data · cs.CL · arXiv 2601.21725 · score 8 — llm, reasoning, attention
Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents · cs.CL · arXiv 2605.28108 · score 8 — llm, agent
ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents · cs.MA · arXiv 2604.07789 · score 8 — agent, agentic
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL · cs.LG · arXiv 2605.28918 · score 8 — llm, agent
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting · cs.LG · arXiv 2605.29727 · score 8 — speculative decoding, gpu, latency
Adapting Automotive Aerodynamics Surrogates to New Vehicle Families via Transfer Learning · cs.CE · arXiv 2605.27968 · score 8 — serving, transformer, fine-tun
Anytime-Valid Federated Conformal RAG for LLM Swarms · stat.ML · arXiv 2605.29139 · score 8 — llm, retrieval, rag
Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning · cs.LG · arXiv 2506.05985 · score 8 — agent, rag, fine-tun
SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning · stat.ML · arXiv 2509.21707 · score 8 — large language model, inference
A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments · cs.LG · arXiv 2512.13517 · score 8 — agent, rag, reasoning
Ciphera: A Decentralised Biometric Identity Framework · cs.CR · arXiv 2605.29868 · score 8 — rag, serving, latency
UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents · cs.AI · arXiv 2605.29534 · score 7 — agent, inference
MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization · cs.AI · arXiv 2605.29951 · score 7 — rag, reasoning, inference
LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation · cs.LG · arXiv 2605.29280 · score 7 — inference, serving
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset · cs.CV · arXiv 2605.29462 · score 7 — rag, reasoning, inference
DLM-SWAI: Steering Diffusion Language Models Before They Unmask · cs.CL · arXiv 2605.29626 · score 7 — inference, serving
ESPO: Early-Stopping Proximal Policy Optimization · cs.LG · arXiv 2605.29860 · score 7 — large language model, reasoning
Unlocking the Working Memory of Large Language Models for Latent Reasoning · cs.CL · arXiv 2605.30343 · score 7 — large language model, reasoning
Modeling Hierarchical Thinking in Large Reasoning Models · cs.AI · arXiv 2510.22437 · score 7 — reasoning, chain-of-thought, inference
Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models · cs.LG · arXiv 2601.01162 · score 7 — large language model, rag
Steering Language Models Before They Speak: Logit-Level Interventions · cs.CL · arXiv 2601.10960 · score 7 — inference, serving
From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons · cs.CL · arXiv 2605.27387 · score 7 — large language model, attention
LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English · cs.CL · arXiv 2605.29048 · score 7 — llm, inference
Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models · cs.CL · arXiv 2605.29459 · score 7 — large language model, attention
Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation · cs.CL · arXiv 2605.29714 · score 7 — rag, moe, fine-tun
Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations · cs.CL · arXiv 2601.08064 · score 7 — large language model, rag
Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning · cs.CL · arXiv 2602.05370 · score 7 — large language model, reasoning
Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation · cs.CL · arXiv 2605.26428 · score 7 — large language model, rag
When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models · cs.LG · arXiv 2601.00065 · score 7 — large language model, fine-tun
NeuroEdge: Real-Time Hand Gesture Recognition with High-Density EMG Using Deep Learning at the Edge · cs.LG · arXiv 2605.29326 · score 7 — rag, inference, latency
Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics · cs.LG · arXiv 2605.29351 · score 7 — inference, attention, transformer
AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference · cs.LG · arXiv 2605.29535 · score 7 — llm, inference
STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction · cs.LG · arXiv 2605.29863 · score 7 — inference, transformer, latency
CLUBench: A Clustering Benchmark · cs.LG · arXiv 2605.29933 · score 7 — large language model, rag
Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables · cs.LG · arXiv 2605.30229 · score 7 — inference, attention, transformer
Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research · stat.ML · arXiv 2605.29249 · score 7 — inference, serving
FPLIER: Federated Pathway-Level Information Extractor · cs.LG · arXiv 2605.29587 · score 7 — inference, distributed training
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training · cs.DC · arXiv 2605.29664 · score 7 — serving, parallelism
Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control · cs.RO · arXiv 2605.29937 · score 7 — inference, serving
SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation · cs.CV · arXiv 2605.30116 · score 7 — inference, serving
DiScoFormer: Plug-In Density and Score Estimation with Transformers · cs.LG · arXiv 2511.05924 · score 7 — inference, attention, transformer
Learning to Solve PDEs on Neural Shape Representations · cs.LG · arXiv 2512.21311 · score 7 — inference, serving
Transformed Latent Variable Multi-Output Gaussian Processes · cs.LG · arXiv 2605.05133 · score 7 — inference, serving
CompilerDream: Learning a Compiler World Model for General Code Optimization · cs.PL · arXiv 2404.16077 · score 7 — agent, compiler
Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training · cs.DC · arXiv 2605.29346 · score 7 — parallelism, gpu, cuda
BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation · cs.AI · arXiv 2605.28994 · score 6 — llm, reasoning
Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents · cs.AI · arXiv 2605.29174 · score 6 — agent, rag
Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth · cs.AI · arXiv 2605.29234 · score 6 — llm, retrieval
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories · cs.AI · arXiv 2605.29253 · score 6 — agent, fine-tun
Xetrieval: Mechanistically Explaining Dense Retrieval · cs.AI · arXiv 2605.29507 · score 6 — retrieval, reasoning, chain-of-thought
Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management · cs.AI · arXiv 2605.29733 · score 6 — rag, transformer, fine-tun
Accelerating Constrained Decoding with Token Space Compression · cs.AI · arXiv 2605.29986 · score 6 — llm, latency
Conformal Certification of Reasoning Trace Prefixes · cs.AI · arXiv 2605.30085 · score 6 — reasoning, serving
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing · cs.AI · arXiv 2605.30117 · score 6 — serving, attention
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection · cs.AI · arXiv 2605.30288 · score 6 — llm, post-train
Specialty-Specific Medical Language Model for Immune-Mediated Diseases · cs.CL · arXiv 2605.28838 · score 6 — llm, transformer
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation · cs.LG · arXiv 2605.28867 · score 6 — rag, serving
AIRGuard: Guarding Agent Actions with Runtime Authority Control · cs.CR · arXiv 2605.28914 · score 6 — agent, reasoning
MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs · cs.CL · arXiv 2605.29300 · score 6 — llm, fine-tun
TRACER: Persistent Regularization for Robust Multimodal Finetuning · cs.LG · arXiv 2605.29380 · score 6 — rag, serving
Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference · cs.LG · arXiv 2605.29467 · score 6 — inference, mixture of experts
PhoneWorld: Scaling Phone-Use Agent Environments · cs.CL · arXiv 2605.29486 · score 6 — agent, rag
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization · cs.LG · arXiv 2605.29547 · score 6 — serving, quantization
COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings · cs.SD · arXiv 2605.29628 · score 6 — retrieval, serving
Personalized Turn-Level User Conversation Satisfaction Benchmark · cs.CL · arXiv 2605.29711 · score 6 — llm, retrieval
Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions · cs.CL · arXiv 2605.29738 · score 6 — llm, reasoning
A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging · eess.IV · arXiv 2605.29753 · score 6 — rag, serving
Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs · cs.HC · arXiv 2605.29928 · score 6 — llm, reasoning
Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders · cs.CL · arXiv 2605.30022 · score 6 — retrieval, attention, transformer
Do Language Models Track Entities Across State Changes? · cs.CL · arXiv 2605.30233 · score 6 — rag, reasoning, transformer
Reinforcement Learning with Robust Rubric Rewards · cs.CV · arXiv 2605.30244 · score 6 — llm, reasoning
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models · cs.AI · arXiv 2604.10219 · score 6 — rag, reasoning, attention
Human-Guided Harm Recovery for Computer Use Agents · cs.AI · arXiv 2604.18847 · score 6 — agent, rag
Dataset-Driven Channel Masks in Transformers for Multivariate Time Series · cs.LG · arXiv 2410.23222 · score 6 — rag, attention, transformer
Obfuscation Rules for Detecting and Detoxifying Korean Toxicity · cs.CL · arXiv 2510.10961 · score 6 — llm, attention
Topological Order in Neural Wavefunctions · cs.AI · arXiv 2512.01863 · score 6 — llm, attention
The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation · cs.IR · arXiv 2512.10388 · score 6 — serving, quantization
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought · cs.CL · arXiv 2603.05488 · score 6 — reasoning, chain-of-thought, attention
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems · cs.CR · arXiv 2604.06811 · score 6 — agent, rag
Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models · eess.SY · arXiv 2604.17176 · score 6 — reasoning, serving
SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction · cs.CL · arXiv 2605.23440 · score 6 — llm, rag
The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs · cs.HC · arXiv 2605.27382 · score 6 — llm, rlhf
From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale · cs.CL · arXiv 2605.28826 · score 6 — llm, rlhf
Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches · cs.CL · arXiv 2605.29188 · score 6 — llm, rag
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents · cs.CL · arXiv 2605.29559 · score 6 — agent, fine-tun
A Dual-Path Architecture for Scaling Compute and Capacity in LLMs · cs.CL · arXiv 2605.30202 · score 6 — llm, transformer
COMPOSE: Composing Future Theorems from Citations and Formal Structure · cs.CL · arXiv 2605.30333 · score 6 — llm, retrieval
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains · cs.LG · arXiv 2605.29156 · score 6 — llm, post-train
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents · cs.CV · arXiv 2605.29447 · score 6 — agent, fine-tun
GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases · cs.IR · arXiv 2605.30237 · score 6 — retrieval, rag, fine-tun
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning · cs.CL · arXiv 2508.19202 · score 6 — llm, reasoning
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context · cs.CL · arXiv 2510.06182 · score 6 — retrieval, rag, reasoning
MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark · cs.CL · arXiv 2601.04633 · score 6 — rag, reasoning, fine-tun
Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction · cs.CL · arXiv 2601.18395 · score 6 — llm, reasoning
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs · cs.CL · arXiv 2603.27518 · score 6 — llm, transformer
TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script) · cs.CL · arXiv 2605.04583 · score 6 — rag, serving
Beyond Transcripts: A Renewed Perspective on Audio Chaptering · cs.SD · arXiv 2602.08979 · score 6 — llm, rag
FedQHD: Closed-Form Function-Space Federated Reinforcement Learning · cs.LG · arXiv 2605.29002 · score 6 — agent, rag
Apertus LLM Family Expansion via Distillation and Quantization · cs.LG · arXiv 2605.29128 · score 6 — llm, quantization
MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding · cs.LG · arXiv 2605.29850 · score 6 — rag, attention, transformer
Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching · cs.LG · arXiv 2605.30337 · score 6 — llm, retrieval
An End-to-End PyTorch Interface for Differentiable PDE Solvers: A RANS Model-Correction Study · cs.CE · arXiv 2605.28858 · score 6 — llm, rag
Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition · cs.LG · arXiv 2505.05968 · score 6 — multi-agent, rag
In-Place Feedback: Reliable Refinement for Multi-Turn Expert-LLM Collaboration · cs.LG · arXiv 2510.00777 · score 6 — llm, reasoning
Optimization and Generation in Aerodynamics Inverse Design · cs.LG · arXiv 2602.03582 · score 6 — rag, serving
Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy · cs.LG · arXiv 2604.26571 · score 6 — mixture of experts, moe
SMolLM: Small Language Models Learn Small Molecular Grammar · cs.LG · arXiv 2605.06322 · score 6 — llm, transformer
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection · cs.LG · arXiv 2605.08870 · score 6 — rag, serving
Faster Molecular Dynamics with Neural Network Potentials via Distilled Multiple Time-Stepping and Non-Conservative Forces · cs.LG · arXiv 2602.14975 · score 6 — serving, fine-tun
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol · eess.IV · arXiv 2603.14644 · score 6 — serving, transformer
IORM: Hierarchical I/O Governance for Thousands of Consolidated Databases on Oracle Exadata · cs.DB · arXiv 2605.29006 · score 6 — rag, scheduler, latency
Trends in AI and Human-AI Interaction in Clinical Trials – A Hybrid Human-AI Exploration · cs.AI · arXiv 2605.29096 · score 5 — large language model
Context Distillation as Latent Memory Management · cs.LG · arXiv 2605.28889 · score 5 — retrieval, inference
The Hamilton-Jacobi Theory of Deep Learning · cs.LG · arXiv 2605.28983 · score 5 — inference, transformer
GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection · cs.CV · arXiv 2605.29539 · score 5 — inference, fine-tun
EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL · cs.CL · arXiv 2605.29670 · score 5 — rag, inference
CB-SLICE: Concept-Based Interpretable Error Slice Discovery · cs.LG · arXiv 2605.29836 · score 5 — rag, inference
Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion · stat.ML · arXiv 2605.30319 · score 5 — rag, inference
You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention · cs.AI · arXiv 2605.27580 · score 5 — inference, attention
Relational In-Context Learning via Synthetic Pre-training with Structural Prior · cs.LG · arXiv 2603.03805 · score 5 — reasoning, inference
Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective · cs.CL · arXiv 2605.29319 · score 5 — reasoning, inference
Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLAR · cs.CL · arXiv 2605.29637 · score 5 — large language model
ExCAM: Explainable Cultural Awareness Metrics · cs.CL · arXiv 2605.29897 · score 5 — large language model
Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning · cs.CL · arXiv 2605.29971 · score 5 — large language model
Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization · cs.CL · arXiv 2604.13197 · score 5 — reasoning, inference
Moment Matching Q-Learning · cs.LG · arXiv 2605.29033 · score 5 — inference, latency
Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems · cs.LG · arXiv 2605.29373 · score 5 — inference, fine-tun
A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine Learning · cs.LG · arXiv 2605.29454 · score 5 — inference, post-train
A Geometric View of SRC: Learning Representations for Stable Residual Inference · cs.LG · arXiv 2605.29673 · score 5 — rag, inference
CRB-Guided Framework Design and Resource Allocation for Indoor mmWave ISCC Systems · cs.IT · arXiv 2605.29939 · score 5 — inference, latency
TraceCodec: A Compiler-Backed Neural Codec for Stateful Multi-Flow Network Traffic Traces · cs.NI · arXiv 2605.29941 · score 5 — rag, compiler
Leave a Window Out: Modifying the Jackknife for Predictive Inference in Time Series · stat.ML · arXiv 2605.30292 · score 5 — rag, inference
KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks · cs.LG · arXiv 2411.00278 · score 5 — rag, inference
Diffusion-based learning framework for Constrained Nonconvex Optimization with Weighted Bootstrapped Refinement · cs.LG · arXiv 2502.10330 · score 5 — rag, inference
Solved in Unit Domain: JacobiNet for Differentiable Coordinate-Transformed PINNs · cs.LG · arXiv 2508.02537 · score 5 — rag, inference
Routing by Reaching: Composition of Pre-trained GFlowNets for Multi-Objective Generation · cs.LG · arXiv 2602.21565 · score 5 — inference, fine-tun
Accelerating trajectory optimization with Sobolev-trained diffusion policies · cs.LG · arXiv 2604.19011 · score 5 — inference, latency
Order-Agnostic Autoregressive Modelling with Missing Data · cs.LG · arXiv 2605.06355 · score 5 — rag, inference
Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models · cs.LG · arXiv 2605.28711 · score 5 — rag, inference
Noise-Aware Differentially Private Variational Inference · stat.ML · arXiv 2410.19371 · score 5 — rag, inference
MEC: Machine-Learning-Assisted Generalized Entropy Calibration for Semi-Supervised Mean Estimation · stat.ML · arXiv 2604.05446 · score 5 — rag, inference
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation · cs.RO · arXiv 2605.22082 · score 5 — inference, transformer
Stop Suppressing the Tail: Causal Inference for Extreme Events · stat.ML · arXiv 2605.27474 · score 5 — rag, inference
Rapid GPU-Based Pangenome Graph Layout · cs.DC · arXiv 2409.00876 · score 5 — parallelism, gpu
Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction · cs.AI · arXiv 2605.28849 · score 4 — llm
Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction · cs.AI · arXiv 2605.28855 · score 4 — llm
The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling · cs.AI · arXiv 2605.28864 · score 4 — transformer, fine-tun
Review Arcade: On the Human Alignment and Gameability of LLM Reviews · cs.AI · arXiv 2605.28897 · score 4 — llm
Orthogonal Concept Erasure for Diffusion Models · cs.AI · arXiv 2605.28902 · score 4 — serving
Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild · cs.AI · arXiv 2605.29018 · score 4 — llm
Differentiable Belief-based Opponent Shaping · cs.AI · arXiv 2605.29042 · score 4 — multi-agent
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure · cs.AI · arXiv 2605.29087 · score 4 — reasoning, chain-of-thought
PRO-CUA: Process-Reward Optimization for Computer Use Agents · cs.AI · arXiv 2605.29119 · score 4 — agent
Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark · cs.AI · arXiv 2605.29400 · score 4 — reasoning, fine-tun
ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control · cs.AI · arXiv 2605.29425 · score 4 — serving
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials · cs.AI · arXiv 2605.29446 · score 4 — rag, reasoning
HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering · cs.AI · arXiv 2605.29606 · score 4 — retrieval, rag
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures · cs.AI · arXiv 2605.29629 · score 4 — llm
Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models · cs.AI · arXiv 2605.29754 · score 4 — transformer, fine-tun
Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk · cs.AI · arXiv 2605.29788 · score 4 — agent
RAISE: RAG Design as an Architecture Search Problem · cs.AI · arXiv 2605.30029 · score 4 — retrieval, rag
BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders · cs.AI · arXiv 2605.30162 · score 4 — rag, fine-tun
Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit · cs.AI · arXiv 2605.30207 · score 4 — retrieval, rag
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection · cs.AI · arXiv 2605.30344 · score 4 — reasoning, fine-tun
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software · cs.AI · arXiv 2605.30353 · score 4 — agent
Self-Play Reinforcement Learning under Imperfect Information in Big 2 · cs.LG · arXiv 2605.28863 · score 4 — agent
Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision · cs.LG · arXiv 2605.28865 · score 4 — agent
TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models · cs.LG · arXiv 2605.28868 · score 4 — retrieval, rag
Representation Alignment Rests on Linear Structure · cs.LG · arXiv 2605.28870 · score 4 — llm
Quantum-Enhanced Adversarial Robustness in Artificial Intelligence · cs.CR · arXiv 2605.28899 · score 4 — ai system
Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening · cs.CR · arXiv 2605.28999 · score 4 — llm
Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning · cs.LG · arXiv 2605.29028 · score 4 — rag, fine-tun
Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG · cs.CL · arXiv 2605.29084 · score 4 — retrieval, rag
When and How Long? The Readout-Mediator Angle in Temporal Reasoning · cs.LG · arXiv 2605.29126 · score 4 — reasoning, attention
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach · cs.LG · arXiv 2605.29161 · score 4 — serving
Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits · cs.CL · arXiv 2605.29268 · score 4 — llm
Does Distributed Training Undermine Compute Governance? · cs.CY · arXiv 2605.29359 · score 4 — distributed training
Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies · cs.IR · arXiv 2605.29384 · score 4 — retrieval, rag
On the Optimizer Dependence of Neural Scaling Laws · cs.LG · arXiv 2605.29387 · score 4 — llm
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions · cs.SE · arXiv 2605.29442 · score 4 — agent
Honest Lying: Understanding Memory Confabulation in Reflexive Agents · cs.LG · arXiv 2605.29463 · score 4 — agent
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling · cs.CV · arXiv 2605.29488 · score 4 — rag, transformer
Brain-IT-VQA: From Brain Signals to Answers · cs.CV · arXiv 2605.29588 · score 4 — rag, transformer
Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation · cs.CV · arXiv 2605.29773 · score 4 — serving
Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions · eess.AS · arXiv 2605.29862 · score 4 — serving
Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate · cs.CL · arXiv 2605.29889 · score 4 — llm
Genetically Aligned Patient Representations Improve Hematological Diagnosis · cs.CV · arXiv 2605.29980 · score 4 — retrieval, transformer
Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation · cs.SD · arXiv 2605.30031 · score 4 — reasoning, latency
Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models · cs.LG · arXiv 2605.30038 · score 4 — fine-tun, post-train
REPOT: Recoverable Program-of-Thought via Checkpoint Repair · cs.SE · arXiv 2605.30052 · score 4 — llm
xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR · cs.CV · arXiv 2605.30111 · score 4 — retrieval, rag
iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis · cs.LG · arXiv 2605.30179 · score 4 — llm
PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions · cs.CV · arXiv 2605.30268 · score 4 — agent
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments · cs.RO · arXiv 2605.30280 · score 4 — rag, reasoning
Archon: A Unified Multimodal Model for Holistic Digital Human Generation · cs.CV · arXiv 2605.30311 · score 4 — serving
Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes · cs.GR · arXiv 2605.30318 · score 4 — llm
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure · cs.AI · arXiv 2602.08783 · score 4 — reasoning, chain-of-thought
FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse Autoformalization · cs.AI · arXiv 2603.19828 · score 4 — llm
When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision-Language Models · cs.AI · arXiv 2603.23085 · score 4 — reasoning, chain-of-thought
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning · cs.AI · arXiv 2604.10228 · score 4 — reasoning, fine-tun
Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents · cs.AI · arXiv 2604.11088 · score 4 — agent
NOVA: Fundamental Limits of Knowledge Discovery Through AI · cs.AI · arXiv 2605.15219 · score 4 — ai system
AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence · cs.AI · arXiv 2605.21739 · score 4 — llm
MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting · cs.LG · arXiv 2306.10356 · score 4 — attention, transformer
Crafting Desirable Climate Trajectories with RL Explored Socio-Environmental Simulations · cs.AI · arXiv 2410.07287 · score 4 — agent
VRAG: Learning World Models for Interactive Video Generation · cs.CV · arXiv 2505.21996 · score 4 — retrieval, rag
Online Fair Division with Additional Information · cs.GT · arXiv 2505.24503 · score 4 — agent
Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning · cs.CL · arXiv 2506.08354 · score 4 — rag, reasoning
Finding DoRI: Discovery of Retained Images in Diffusion Models · cs.CV · arXiv 2507.16880 · score 4 — rag, fine-tun
Scalable RF Simulation in Generative 4D Worlds · cs.CV · arXiv 2508.12176 · score 4 — serving
Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy · cs.LG · arXiv 2509.21190 · score 4 — rag, transformer
ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling · cs.RO · arXiv 2511.04758 · score 4 — rag, gpu
Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom · cs.LG · arXiv 2511.11703 · score 4 — agent
Revisiting the Reliability of Language Models in Instruction-Following · cs.SE · arXiv 2512.14754 · score 4 — llm
HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens · cs.CE · arXiv 2512.15133 · score 4 — quantization, fine-tun
NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning · cs.LG · arXiv 2601.19947 · score 4 — serving
Learn from A Rationalist: Distilling Intermediate Interpretable Rationales · cs.LG · arXiv 2601.22531 · score 4 — attention, transformer
AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing · cs.CL · arXiv 2603.23069 · score 4 — serving
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation · cs.SE · arXiv 2605.12925 · score 4 — agent
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents · cs.SD · arXiv 2605.13841 · score 4 — agent
Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate · cs.LG · arXiv 2605.25134 · score 4 — serving
QuITE: Query-Based Irregular Time Series Embedding · cs.LG · arXiv 2605.28166 · score 4 — rag, attention
What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs · cs.CL · arXiv 2605.28823 · score 4 — llm
A Modular Architecture for Typologically Controlled Lexicon Generation · cs.CL · arXiv 2605.28824 · score 4 — llm
Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models · cs.CL · arXiv 2605.28913 · score 4 — reasoning, chain-of-thought
Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization · cs.CL · arXiv 2605.29274 · score 4 — llm
Accommodation Goes Both Ways: Studying Linguistic Convergence Between Humans and Language Models · cs.CL · arXiv 2605.29278 · score 4 — llm
STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments · cs.CL · arXiv 2605.29324 · score 4 — agent
A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities · cs.CL · arXiv 2605.29340 · score 4 — llm
BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base · cs.CL · arXiv 2605.29379 · score 4 — serving
Scaling Laws for Agent Harnesses via Effective Feedback Compute · cs.CL · arXiv 2605.29682 · score 4 — agent
Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking · cs.CL · arXiv 2605.30107 · score 4 — retrieval, rag
CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution · cs.CL · arXiv 2605.30133 · score 4 — llm
Resolution Diagnostics for Paired LLM Evaluation · cs.CL · arXiv 2605.30315 · score 4 — llm
Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence · cs.SE · arXiv 2605.29054 · score 4 — agent
Offloading Score: Measuring AI Reliance Through Counterfactual Workflows · cs.SE · arXiv 2605.29392 · score 4 — agent
DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces? · cs.CV · arXiv 2605.29615 · score 4 — agent
How’s it going? Reinforcement learning in language models recruits a functional welfare axis · cs.LG · arXiv 2605.30232 · score 4 — fine-tun, post-train
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents · cs.CV · arXiv 2605.30256 · score 4 — agent
Interactive In-Meeting Speaker Correction with Human Feedback · cs.CL · arXiv 2509.18377 · score 4 — llm
The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs · cs.CL · arXiv 2601.03134 · score 4 — llm
One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them · cs.LG · arXiv 2605.28839 · score 4 — attention, transformer
Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning · cs.LG · arXiv 2605.28990 · score 4 — rag, fine-tun
Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning · cs.LG · arXiv 2605.29032 · score 4 — agent
Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data · cs.LG · arXiv 2605.29058 · score 4 — serving
Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules · cs.LG · arXiv 2605.29075 · score 4 — llm
Solving Integer Linear Programming with Parallel Tempering · cs.LG · arXiv 2605.29366 · score 4 — serving
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging · cs.LG · arXiv 2605.29489 · score 4 — llm
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models · cs.LG · arXiv 2605.29607 · score 4 — reasoning, attention
M=oLe-{\Lambda}: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties · cs.LG · arXiv 2605.29622 · score 4 — serving
Relational Rank Geometry in Transformers: Detecting and Steering Hidden-State Relation Frames · cs.LG · arXiv 2605.29634 · score 4 — attention, transformer
Momentum Based Reward Design for Low Emission Traffic Signal Control · cs.LG · arXiv 2605.29693 · score 4 — rag, throughput
MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion · cs.LG · arXiv 2605.29765 · score 4 — llm
Open Problem: Separating Geometric and Algorithmic Compression via Cayley-Table Completion · cs.LG · arXiv 2605.29885 · score 4 — serving
Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation · cs.LG · arXiv 2605.29911 · score 4 — serving
A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy · cs.LG · arXiv 2605.29975 · score 4 — serving
Improving Adversarial Robustness of Attribution via Implicit Regularization · cs.LG · arXiv 2605.29983 · score 4 — attention, transformer
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood · cs.LG · arXiv 2605.30154 · score 4 — serving
Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents · cs.LG · arXiv 2605.30190 · score 4 — agent
Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories · cs.LG · arXiv 2605.30275 · score 4 — attention, transformer
Towards a Foundation Model for the Martian Atmosphere · cs.LG · arXiv 2605.28851 · score 4 — retrieval, rag
Eulerian Gaussian Splatting using Hashed Probability Pyramids · cs.CV · arXiv 2605.29136 · score 4 — serving
Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion · cs.SD · arXiv 2605.29531 · score 4 — attention, fine-tun
EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation · cs.CV · arXiv 2605.29977 · score 4 — reasoning, attention
Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance · cs.RO · arXiv 2605.30056 · score 4 — rag, attention
Privacy-Enhanced Zero-Order Federated Learning via xMK-CKKS over Wireless Channels · cs.CR · arXiv 2605.30123 · score 4 — serving
SAHG: Sector-Anisotropic Hyperbolic Graph Model for Social Bot Detection · cs.SI · arXiv 2605.30166 · score 4 — llm
Looking around you: external information enhances representations for event sequences · cs.LG · arXiv 2502.10205 · score 4 — attention, fine-tun
Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL · cs.LG · arXiv 2508.08677 · score 4 — serving
Horizon Activation Mapping for Neural Networks in Time Series Forecasting · cs.LG · arXiv 2601.02094 · score 4 — rag, attention
Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations · cs.LG · arXiv 2602.01456 · score 4 — serving
Size Transferability of Graph Transformers with Convolutional Positional Encodings · cs.LG · arXiv 2602.15239 · score 4 — attention, transformer
Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models · cs.LG · arXiv 2602.19619 · score 4 — llm
Statistical Consistency and Generalization of Contrastive Representation Learning · cs.LG · arXiv 2605.02116 · score 4 — retrieval, attention
Building a privacy-preserving Federated Recommender system for mobile devices · cs.LG · arXiv 2605.22924 · score 4 — serving
On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series · cs.LG · arXiv 2605.26194 · score 4 — attention, transformer
Density-aware Sample-specific Attack · cs.LG · arXiv 2605.27809 · score 4 — fine-tun, post-train
Adversarial Robustness in One-Stage Learning-to-Defer · stat.ML · arXiv 2510.10988 · score 4 — serving
Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds · cs.CV · arXiv 2510.27391 · score 4 — attention, transformer
Envy-Free Allocation of Indivisible Goods via Noisy Queries · cs.GT · arXiv 2602.06361 · score 4 — agent
RAFI – A Ray/Work Forwarding Infrastructure for Data Parallel Multi-Node/Multi-GPU Computing · cs.DC · arXiv 2605.30294 · score 4 — gpu, cuda
A Quick and Exact Method for Distributed Quantile Computation · cs.DC · arXiv 2511.12025 · score 4 — rag, latency
A Secure, Manifest-Based Framework for Delegated Privilege Promotion · cs.CR · arXiv 2605.28991 · score 4 — serving
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers · cs.LG · arXiv 2605.29005 · score 3 — inference
A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router · math.DS · arXiv 2605.29121 · score 3 — moe
Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems · cs.LG · arXiv 2605.29194 · score 3 — inference
Causal Disentanglement-Inspired Degradation Representation Learning for Full-Reference Image Quality Assessment · cs.CV · arXiv 2604.21654 · score 3 — inference
Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification · stat.ML · arXiv 2605.12208 · score 3 — inference
Auditing Training Data in Generative Music Models via Black-Box Membership Inference · cs.LG · arXiv 2605.29202 · score 3 — inference
From Short Histories to Long Futures: Horizon-Aware Graph Neural Networks for Long Horizon Forecasting · cs.LG · arXiv 2605.29952 · score 3 — inference
Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption · cs.LG · arXiv 2605.30089 · score 3 — inference
When, why, and how do diffusion posterior samplers fail? A finite-sample lens · cs.LG · arXiv 2605.30330 · score 3 — inference
Mixing Vector Model for Copolymer Inference via Mixed Integer Linear Programming · cs.LG · arXiv 2605.29329 · score 3 — inference
Wasserstein Contraction of Coordinate Ascent Variational Inference · stat.ML · arXiv 2605.30253 · score 3 — inference
Cooperative Variance Estimation and Bayesian Neural Networks for Disentangling Aleatoric and Epistemic Uncertainties · cs.LG · arXiv 2505.02743 · score 3 — inference
Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference · cs.LG · arXiv 2601.14855 · score 3 — inference
Riemannian AmbientFlow: Towards Simultaneous Manifold Learning and Generative Modeling from Corrupted Data · cs.LG · arXiv 2601.18728 · score 3 — inference
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning · cs.LG · arXiv 2605.01663 · score 3 — inference
Uncertainty Estimation via Hyperspherical Confidence Mapping · cs.LG · arXiv 2605.05964 · score 3 — inference
Inpainting physics: self-supervised learning for context-driven fluid simulation · cs.LG · arXiv 2605.08832 · score 3 — inference
Matryoshka Concept Bottleneck Models · cs.LG · arXiv 2605.20612 · score 3 — inference
Enhancing Membership Inference Attacks on Diffusion Models from a Frequency-Domain Perspective · cs.CR · arXiv 2505.20955 · score 3 — inference
Bridging Maximum Likelihood and Optimal Transport for Efficient Inference and Model Selection in Stochastic Block Models · stat.ML · arXiv 2605.28488 · score 3 — inference
Constant Depth Threshold Circuits For Exhaustive Epistasis Detection · cs.AR · arXiv 2605.29719 · score 3 — parallelism
Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics · cs.AI · arXiv 2605.29078 · score 2 — rag
Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI · cs.AI · arXiv 2605.29240 · score 2 — attention
Rubric-Guided Process Reward for Stepwise Model Routing · cs.AI · arXiv 2605.29310 · score 2 — reasoning
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet · cs.AI · arXiv 2605.29358 · score 2 — transformer
Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion · cs.AI · arXiv 2605.29591 · score 2 — reasoning
FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting · cs.AI · arXiv 2605.29695 · score 2 — transformer
From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks · cs.AI · arXiv 2605.29768 · score 2 — retrieval
Quantifying and Optimizing Simplicity via Polynomial Representations · cs.AI · arXiv 2605.29823 · score 2 — fine-tun
On the Geometry of Games and their Solvers · cs.AI · arXiv 2605.29919 · score 2 — rag
A comparative study of transformer-based embeddings for topic coherence · cs.CL · arXiv 2605.28832 · score 2 — transformer
Transcribing Children’s Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions · cs.CL · arXiv 2605.28833 · score 2 — fine-tun
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks · cs.LG · arXiv 2605.29001 · score 2 — reasoning
Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving · cs.RO · arXiv 2605.29138 · score 2 — latency
Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children’s Data · cs.CV · arXiv 2605.29230 · score 2 — rag
Extreme dynamic symmetry enables omnidirectional and multifunctional robots · cs.RO · arXiv 2605.29254 · score 2 — rag
KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs · cs.LG · arXiv 2605.29259 · score 2 — rag
Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts · cs.LG · arXiv 2605.29283 · score 2 — rag
DELOS: Detecting Shallow Transits in Kepler Photometry Using a Contrastive-Learning Framework · cs.AI · arXiv 2605.29428 · score 2 — gpu
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions · cs.LG · arXiv 2605.29448 · score 2 — rag
Evolutionary Rule Extraction from Corporate Default Prediction Models · cs.NE · arXiv 2605.29478 · score 2 — rag
Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection · cs.CR · arXiv 2605.29526 · score 2 — rag
Data filtering methods for training language models · cs.CL · arXiv 2605.29807 · score 2 — fine-tun
Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations · cs.AI · arXiv 2605.29976 · score 2 — rag
Test Time Training for Supervised Causal Learning · cs.LG · arXiv 2605.30015 · score 2 — rag
Masked Diffusion Modeling for Anomaly Detection · cs.LG · arXiv 2605.30046 · score 2 — rag
A Predictive Law for On-Policy Self-Distillation From World Feedback · cs.LG · arXiv 2605.30070 · score 2 — post-train
Self-Trained Verification for Training- and Test-Time Self-Improvement · cs.LG · arXiv 2605.30290 · score 2 — reasoning
Reasoning with Sampling: Cutting at Decision Points · cs.LG · arXiv 2605.30327 · score 2 — reasoning
TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech · cs.AI · arXiv 2601.11178 · score 2 — reasoning
Recurrent Structural Policy Gradient for Partially Observable Mean Field Games · cs.AI · arXiv 2602.20141 · score 2 — rag
Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases · cs.AI · arXiv 2603.07916 · score 2 — rag
A Foundation Model for Zero-Shot Logical Rule Induction · cs.AI · arXiv 2605.04916 · score 2 — reasoning
Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes · cs.RO · arXiv 2205.04297 · score 2 — fine-tun
A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes · cs.LG · arXiv 2502.16548 · score 2 — rag
Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data · cs.SD · arXiv 2502.20838 · score 2 — rag
Taming Data Challenges in ML-based Security Tasks Using Generative AI · cs.CR · arXiv 2507.06092 · score 2 — attention
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models · cs.CV · arXiv 2507.09574 · score 2 — attention
Page image classification for content-specific data processing · cs.IR · arXiv 2507.21114 · score 2 — rag
Approximate Proportionality in Online Fair Division · cs.GT · arXiv 2508.03253 · score 2 — attention
The Impact of Semantic Pairs on Self-Supervised Representation Learning · cs.LG · arXiv 2510.08722 · score 2 — rag
MiAD: Mirage Atom Diffusion for De Novo Crystal Generation · cs.LG · arXiv 2511.14426 · score 2 — rag
Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach · cs.CV · arXiv 2511.19316 · score 2 — fine-tun
BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models · cs.LG · arXiv 2512.00283 · score 2 — rag
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving · cs.CV · arXiv 2512.04733 · score 2 — reasoning
Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models · cs.LG · arXiv 2601.14758 · score 2 — post-train
S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling · cs.CL · arXiv 2602.11065 · score 2 — reasoning
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model · cs.SD · arXiv 2602.12304 · score 2 — attention
Post-Training Language Models for Crosslingual Consistency · cs.CL · arXiv 2603.04678 · score 2 — post-train
BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps · cs.SD · arXiv 2604.19532 · score 2 — transformer
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio · cs.SD · arXiv 2605.00969 · score 2 — reasoning
Aes3D: Aesthetic Assessment in 3D Gaussian Splatting · cs.CV · arXiv 2605.05155 · score 2 — attention
AttenA+: Rectifying Action Inequality in Robotic Foundation Models · cs.RO · arXiv 2605.13548 · score 2 — attention
The Distillation Game: Adaptive Attacks & Efficient Defenses · cs.LG · arXiv 2605.22737 · score 2 — reasoning
Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery · cs.CV · arXiv 2605.24460 · score 2 — rag
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos · cs.RO · arXiv 2605.24934 · score 2 — rag
Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4 · cs.LO · arXiv 2605.25556 · score 2 — rag
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection · cs.LG · arXiv 2605.26193 · score 2 — rag
ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation · cs.LG · arXiv 2605.28293 · score 2 — rag
From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization · cs.CL · arXiv 2605.28874 · score 2 — reasoning
Prompt-Level Reward Specifications for Open-Ended Post-Training · cs.CL · arXiv 2605.29275 · score 2 — post-train
Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification · cs.CL · arXiv 2605.29367 · score 2 — attention
World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models · cs.CL · arXiv 2605.29585 · score 2 — reasoning
Metric-Dependent Annotation Saturation for Learning from Label Distributions · cs.CL · arXiv 2605.29797 · score 2 — fine-tun
Early Detection of Misinformation for Infodemic Management: A Domain Adaptation Approach · cs.CL · arXiv 2406.10238 · score 2 — rag
What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies · cs.CL · arXiv 2603.02082 · score 2 — rag
X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting · cs.CV · arXiv 2603.09632 · score 2 — rag
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit · cs.LG · arXiv 2605.28873 · score 2 — quantization
Spectral Guidance for Flexible and Efficient Control of Diffusion Models · cs.LG · arXiv 2605.28900 · score 2 — rag
Sequential Physics-Constrained Neural Operator Forward Modeling for the $\textit{Norne}$ Reservoir System · cs.LG · arXiv 2605.28909 · score 2 — gpu
Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems · cs.LG · arXiv 2605.28912 · score 2 — rag
Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization · cs.LG · arXiv 2605.29021 · score 2 — fine-tun
Model Merging by Output-Space Projection · cs.LG · arXiv 2605.29101 · score 2 — fine-tun
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation · cs.LG · arXiv 2605.29108 · score 2 — fine-tun
PROTOCOL: Late Interaction Retrieval for Protein Homolog Search · cs.LG · arXiv 2605.29158 · score 2 — retrieval
Traditional machine learning vs. deep learning from dynamic graph representations of proteins’ 3D folds in the task of protein structure classification · cs.LG · arXiv 2605.29228 · score 2 — rag
Robust Frequency-Calibrated Virtual EEG Channel Generation from Four Frontal Electrodes for Wearable EEG Augmentation · cs.LG · arXiv 2605.29263 · score 2 — attention
Information-Directed Offline-to-Online Reinforcement Learning · cs.LG · arXiv 2605.29405 · score 2 — rag
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption · cs.LG · arXiv 2605.29497 · score 2 — retrieval
Realistic honeypot evaluations for scheming propensity · cs.LG · arXiv 2605.29729 · score 2 — rag
Gated Graph Attention Networks with Learnable Temperature · cs.LG · arXiv 2605.29803 · score 2 — attention
OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment · cs.LG · arXiv 2605.29900 · score 2 — retrieval
Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression · cs.LG · arXiv 2605.29932 · score 2 — transformer
Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization · cs.LG · arXiv 2605.30059 · score 2 — rag
Q-ANCHOR: Federated Quantum Learning with ZNE-guided Correction · cs.LG · arXiv 2605.30075 · score 2 — rag
Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences · cs.LG · arXiv 2605.30100 · score 2 — transformer
Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation · cs.LG · arXiv 2605.30112 · score 2 — retrieval
Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation · cs.LG · arXiv 2605.30132 · score 2 — fine-tun
Can AI Weather Models Predict Beyond Two Weeks? A Quantitative Benchmark and Analysis of Long Rollouts · cs.LG · arXiv 2605.30184 · score 2 — transformer
ExDBSCAN: Explaining DBSCAN with Counterfactual Reasoning – Additional Material · cs.LG · arXiv 2605.30225 · score 2 — reasoning
Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor · cs.LG · arXiv 2605.30277 · score 2 — rag
WASHH: An Anchor-Aware Whale-Guided Selection Hyper-Heuristic for Continuous Optimization and SVC Configuration · cs.NE · arXiv 2605.28844 · score 2 — rag
Financially Guided Deep Portfolio Optimization · cs.LG · arXiv 2605.28853 · score 2 — attention
Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection · cs.CV · arXiv 2605.29092 · score 2 — rag
ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving · cs.CR · arXiv 2605.29114 · score 2 — reasoning
Real-Time Retargeting Using Controllability Boundary for Chandrayaan-3 Lunar Landing · eess.SY · arXiv 2605.29412 · score 2 — rag
Deep Optimal Individualized Treatment Rules for Bivariate Survival Outcomes via Adaptive Prediction-Powered Learning · stat.ML · arXiv 2605.29464 · score 2 — rag
The Complexity of Verifying Feedforward Neural Networks in Quantised Settings · cs.CC · arXiv 2605.29537 · score 2 — reasoning
Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological Scoring · cs.CV · arXiv 2605.29852 · score 2 — transformer
Gesture-Aware Indoor THz ISAC Systems for Adaptive Resource Allocation · cs.IT · arXiv 2605.29913 · score 2 — rag
Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks · stat.ML · arXiv 2605.30167 · score 2 — rag
Unveiling the Visual Counting Bottleneck in Vision-Language Models · cs.MM · arXiv 2605.30170 · score 2 — reasoning
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation · cs.RO · arXiv 2605.30350 · score 2 — rag
An Empirical Study of the Influence of Adversarial Fine-Tuning on Compressed Neural Networks · cs.LG · arXiv 2403.09441 · score 2 — fine-tun
A Quotient Homology Theory of Representation in Neural Networks · cs.LG · arXiv 2502.01360 · score 2 — rag
Connecting Independently Trained Modes via Layer-Wise Connectivity · cs.LG · arXiv 2505.02604 · score 2 — transformer
Active Learning for Machine Learning Driven Molecular Dynamics · cs.LG · arXiv 2509.17208 · score 2 — rag
FedBiCross: Personalized One-Shot Federated Learning on Medical Images · cs.LG · arXiv 2601.01901 · score 2 — rag
Achieving Linear Speedup for Composite Federated Learning · cs.LG · arXiv 2602.03357 · score 2 — rag
Computationally Efficient Replicable Learning of Parities and Applications · cs.LG · arXiv 2602.09499 · score 2 — rag
Collaborative Threshold Watermarking · cs.LG · arXiv 2602.10765 · score 2 — fine-tun
Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences · cs.LG · arXiv 2605.26756 · score 2 — attention
Continual Learning in Modern Hopfield Networks with an Application to Diffusion Models · cs.LG · arXiv 2605.27975 · score 2 — fine-tun
MVP-Shapley: Feature-based Modeling for Evaluating the Most Valuable Player in Basketball · cs.GT · arXiv 2506.04602 · score 2 — rag
A Complete Loss Landscape Analysis of Regularized Deep Matrix Factorization · math.OC · arXiv 2506.20344 · score 2 — rag
SpeedCP: Fast Kernel-based Conditional Conformal Prediction · stat.ME · arXiv 2509.24100 · score 2 — rag
Contrastive Representation Regularization for Vision-Language-Action Models · cs.RO · arXiv 2510.01711 · score 2 — rag
Permutation-Invariant Spectral Learning via Dyson Diffusion · stat.ML · arXiv 2510.08535 · score 2 — rag
Calibrating Generative Models to Distributional Constraints · stat.ML · arXiv 2510.10020 · score 2 — fine-tun
Securing SIM-Assisted Wireless Networks via Quantum Reinforcement Learning · cs.NI · arXiv 2602.13238 · score 2 — rag
Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression · stat.ME · arXiv 2604.13410 · score 2 — rag
A Deep Learning Model for Battery State Prediction towards Intelligent Energy Management · eess.SP · arXiv 2605.00898 · score 2 — rag
Paris 2.0: A Decentralized Diffusion Model for Video Generation · cs.CV · arXiv 2605.26064 · score 2 — gpu
Design and Implementation of a Serverless MapReduce Framework for Scalable Data Pipelines · cs.DC · arXiv 2605.29573 · score 2 — rag
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration · cs.DC · arXiv 2605.29728 · score 2 — gpu
Capsule: Efficient Player Isolation for Datacenters · cs.DC · arXiv 2506.11483 · score 2 — gpu
Precomputed 1D-CNNs for Atrial Fibrillation Detection on Tiny Smart Sensor Systems · cs.AR · arXiv 2605.29994 · score 2 — latency
elasticAI.explorer: Towards a Unified End-to-End Framework for Hardware-Aware Neural Architecture Search · cs.AR · arXiv 2605.30019 · score 2 — latency
Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory · cs.AR · arXiv 2603.06951 · score 2 — rag