2026-04-22 Paper Digest on JXIN's Home

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Mon, 27 Apr 2026 05:17:00 +0000

arXiv: 2604.20987 · PDF

Authors: Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, Dinesh Manocha

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, rag, reasoning

TL;DR

COSPLAY is a co-evolution framework pairing an LLM decision agent with a learnable skill bank: the decision agent retrieves skills to act, while a skill-pipeline agent mines reusable skills from unlabeled rollouts. An 8B model beats four frontier LLM baselines by >25% average reward on six game environments.

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

Mon, 27 Apr 2026 05:16:25 +0000

arXiv: 2604.21154 · PDF

Authors: Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, agent, agentic, multi-agent, rag

TL;DR

Proposes a four-agent system that parses clinical notes, generates patient-specific exercise videos, tracks poses in real time, and delivers corrective feedback for at-home physiotherapy. The paper is largely architectural, presenting a prototype and evaluation plan rather than clinical results.

Key Ideas

Tele-rehabilitation gap stems from static video libraries and generic avatars ignoring patient-specific constraints.
A Multi-Agent System (MAS) can close the loop by combining generative video, pose estimation, and autonomous feedback.
Four specialized micro-agents cover extraction, synthesis, vision, and diagnostics.
Unstructured clinical notes can be turned into kinematic constraints that condition downstream generation.

Approach

Four micro-agents pipeline:

EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

Mon, 27 Apr 2026 05:15:48 +0000

arXiv: 2604.20133 · PDF

Authors: Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang, Fangzheng Li

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, multi-agent, rag

TL;DR

EvoAgent is an evolvable LLM agent framework combining structured skill learning, hierarchical sub-agent delegation, and a three-layer memory. On real-world foreign-trade tasks with GPT5.2, it lifts a five-dimensional LLM-as-Judge score by ~28%.

Key Ideas

Skills modeled as multi-file structured capability units with triggers and evolutionary metadata.
User-feedback-driven closed loop for continuous skill generation and optimization.
Three-stage skill matching plus three-layer memory architecture for long-term accumulation.
Hierarchical sub-agent delegation enabling dynamic task decomposition.
Agent performance depends on model–architecture synergy, not just base model strength.

Approach

Each skill is a structured artifact (multiple files) carrying triggering logic and evolutionary metadata, so the system can decide when to invoke it and how to mutate it over time. A three-stage matcher selects skills for an incoming task; a three-layer memory separates short-term, working, and long-term context. A hierarchical delegation mechanism spawns sub-agents for decomposed subtasks, and a user-feedback closed loop drives skill creation and refinement.

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

Mon, 27 Apr 2026 05:14:56 +0000

arXiv: 2604.20183 · PDF

Authors: Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, rag, reasoning, inference

TL;DR

DCM-Agent is a training-free framework that resolves structural ambiguity in LLM-based optimization problem solving by maintaining dual clusters of historical solutions (modeling + coding), distilled into Approach/Checklist/Pitfall knowledge, and using them for memory-augmented inference.

Key Ideas

Optimization problems suffer from multi-paradigm ambiguity that confuses LLMs.
Split memory into two clusters: modeling and coding.
Distill each cluster into three structured knowledge types: Approach, Checklist, Pitfall.
Use memory at inference for path navigation, error repair, and adaptive switching.
Observed “knowledge inheritance”: memory from larger models lifts smaller models.

Approach

The Dual-Cluster Memory Construction step routes prior solutions into modeling vs. coding clusters, then distills generalizable guidance into structured Approach / Checklist / Pitfall entries. At inference, the agent retrieves relevant memory to pick a reasoning path, detects and repairs errors, and adaptively switches paradigms. The entire pipeline is training-free, relying on prompting plus a structured memory bank.

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

Mon, 27 Apr 2026 05:14:26 +0000

arXiv: 2604.20503 · PDF

Authors: Wenyan Chen, Chengzhi Lu, Yanying Lin, Dmitrii Ustiugov

Primary category: cs.DC · all: cs.DC

Matched keywords: llm, inference, serving, speculative decoding, gpu, throughput, latency

TL;DR

FASER is a fine-grained speculative-decoding scheduler for dynamic LLM serving that tunes speculative length per request, prunes rejected tokens early, and spatially overlaps draft and verification phases, yielding up to 53% higher throughput and 1.92× lower latency over SOTA in vLLM.

Key Ideas

Coarse-grained, batch-level speculative decoding (SD) wastes GPU cycles under both low and high load.
Speculative length should be a per-request knob inside a continuous batch, not a global constant.
Verification can be chunked into “frontiers” and overlapped with drafting via spatial multiplexing.
Rejected tokens can be pruned mid-verification to avoid wasted compute.

Approach

FASER extends vLLM with three mechanisms: (1) dynamic per-request speculative length based on acceptance behavior within a continuous batch; (2) early pruning that terminates verification for tokens already rejected, reclaiming GPU work; (3) frontier-based verification that splits the verify pass into chunks and co-executes them with draft kernels using fine-grained spatial multiplexing for low interference.

Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

Mon, 27 Apr 2026 05:13:53 +0000

arXiv: 2604.20658 · PDF

Authors: Shivani Kumar, Adarsh Bharathwaj, David Jurgens

Primary category: cs.CL · all: cs.CL

Matched keywords: large language model, llm, agent, multi-agent, reasoning, gpu

TL;DR

Authors benchmark 35 open-weight LLMs on six behavioral-economics games and show that the resulting “cooperative profiles” predict downstream team performance in AI-for-Science workflows under shared budget constraints, offering a cheap diagnostic for multi-agent deployment.

Key Ideas

Cooperative disposition is a distinct, measurable LLM property, not reducible to general capability.
Behavioral-economics games isolate cooperation mechanisms that transfer to realistic multi-agent science tasks.
Models favoring multiplicative team production over greedy strategies yield better scientific reports.
Game-based screening can precede expensive multi-agent rollouts.

Approach

Evaluate 35 open-weight LLMs across six behavioral-economics games targeting distinct cooperation mechanisms (coordination, investment, resource sharing).
Derive per-model “cooperative profiles” from game behavior.
Deploy LLM teams in an AI-for-Science pipeline: collaboratively analyze data, build models, and write scientific reports under shared budgets (e.g., GPU/credit caps).
Regress downstream outcomes on cooperative profile features while controlling for confounds (likely model size, general ability benchmarks).

Experiments

Models: 35 open-weight LLMs.
Games: six behavioral-economics tasks (abstract not specific, but likely includes public-goods, trust, coordination variants).
Downstream task: multi-agent AI-for-Science workflow with shared constraints.
Metrics: report accuracy, quality, and completion.
Baselines / controls: general-ability factors partialled out.

Results

Cooperative profiles robustly predict downstream accuracy, quality, and completion.
Effect persists after controlling for multiple confounding factors.
Headline numerical effect sizes not given in the abstract.

Why It Matters

Provides a fast, inexpensive screening tool for multi-agent LLM deployments where coordination and budget-sharing matter.
Reframes multi-agent selection beyond raw benchmark scores toward cooperative disposition.
Useful for agent/infra teams building scientific, engineering, or tool-using LLM collectives.

Connections to Prior Work

Behavioral-economics probes of LLMs (trust games, ultimatum, public-goods studies).
Multi-agent LLM frameworks (AutoGen, MetaGPT, ChatDev, AI-Scientist).
Work on LLM “personality” / social-preference elicitation.
Emergent cooperation and game-theoretic evaluations in RL agents.
Scientific-writing and data-analysis agent benchmarks.

Open Questions

Which specific games carry the most predictive signal, and do they generalize beyond AI-for-Science?
Does cooperative profile stay stable under prompting, fine-tuning, or RLHF interventions?
Are closed-weight frontier models (GPT-4.x, Claude, Gemini) consistent with the 35-model findings?
Can cooperative disposition be deliberately trained or aligned, and at what cost to single-agent capability?
How do heterogeneous teams (mixing cooperators and defectors) behave versus homogeneous ones?

Figures

Figure 1: Page 2 (rendered)

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Mon, 27 Apr 2026 05:13:18 +0000

arXiv: 2604.20994 · PDF

Authors: Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi, John D. Kelleher

Primary category: cs.CR · all: cs.AI, cs.CL, cs.CR

Matched keywords: large language model, llm, agent, agentic, reasoning, attention

TL;DR

This paper introduces Function Hijacking Attacks (FHA), a novel adversarial technique that manipulates agentic LLMs’ tool selection to force invocation of attacker-chosen functions, achieving 70-100% attack success rates across five models on the BFCL benchmark, largely independent of query semantics.

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Mon, 27 Apr 2026 05:12:44 +0000

arXiv: 2604.20795 · PDF

Authors: Pavel Salovskii, Iuliia Gorshkova

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, rag, reasoning, inference

TL;DR

The paper proposes a hybrid architecture augmenting LLMs with an external RDF/OWL ontological memory layer, automatically constructed from heterogeneous sources, to enable persistent, verifiable, and semantically grounded reasoning beyond vector-based RAG.

Key Ideas

LLMs suffer from weak long-term memory, poor structure, and unreliable multi-step reasoning.
An external ontology (RDF/OWL knowledge graph) acts as verifiable memory and planning substrate.
Automated pipeline builds and maintains the ontology from documents, APIs, and dialogue logs.
SHACL/OWL constraints turn inference into a generation–verification–correction loop.
Hybrid inference combines vector retrieval, graph reasoning, and external tool calls.

Approach

The pipeline extracts entities and relations from heterogeneous inputs, normalizes them, and generates RDF triples. Triples are validated against SHACL shapes and OWL axioms, then merged into a continuously updated knowledge graph. At inference time, the LLM conditions on a composite context fusing vector-retrieved passages, graph subqueries, and tool outputs. Generated answers are checked against ontology constraints; violations trigger correction, yielding a closed verify-and-repair loop.

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

Mon, 27 Apr 2026 05:12:02 +0000

arXiv: 2604.20452 · PDF

Authors: Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu

Primary category: cs.IR · all: cs.CL, cs.IR

Matched keywords: large language model, llm, agent, agentic, retrieval, rag, inference, latency

TL;DR

HaS accelerates Retrieval-Augmented Generation by speculatively retrieving from a restricted scope, then validating candidates via “homologous query re-identification” — checking whether the incoming query matches a previously-seen one. This bypasses full-database search for repeat-like queries, cutting latency 24–37% with 1–2% accuracy loss.

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

Mon, 27 Apr 2026 05:11:32 +0000

arXiv: 2604.20146 · PDF

Authors: Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin

Primary category: cs.IR · all: cs.CL, cs.IR

Matched keywords: large language model, llm, agent, agentic, tool-use, retrieval, reasoning, chain-of-thought, serving, fine-tun

TL;DR

SAKE is an end-to-end agentic framework for Grounded Multimodal Named Entity Recognition (GMNER) that blends internal MLLM knowledge with external retrieval via self-aware reasoning, deciding when to invoke search tools to handle long-tailed and unseen entities on social media.