arXiv: 2604.19689 · PDF

作者: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

主分类: cs.AI · 全部: cs.AI

命中关键词: large language model, llm, agent, retrieval, reasoning, ai system


TL;DR

A-MAR 提出基于 agent 的多模态艺术检索框架,先生成结构化推理计划再条件化检索,实现可解释的艺术品细粒度理解。

核心观点

  • 现有 MLLM 解释艺术品依赖隐式推理和内化知识,缺乏可解释性与证据支撑。
  • 将 retrieval 显式条件化在结构化推理计划上,可支持分步、grounded 的解释。
  • 推出 ArtCoT-QA 诊断基准,评估多步推理链而非只看最终答案准确率。

方法

给定艺术品与用户查询,A-MAR 先把任务分解为结构化推理计划(明确每一步的目标与证据需求),再基于该计划进行有针对性的证据检索,最终生成分步、可追溯的解释。整体流程为 agent-based:plan → retrieve → explain。

实验

  • 数据集:SemArt、Artpedia,以及自建 ArtCoT-QA(多步推理链诊断基准)。
  • 基线:静态非计划式检索方法、强 MLLM baseline。
  • 指标:最终解释质量、证据 grounding、多步推理能力(细粒度诊断)。

结果

A-MAR 在 SemArt 和 Artpedia 上的最终解释质量持续优于静态检索和 MLLM 基线;在 ArtCoT-QA 上,证据 grounding 与多步推理表现进一步领先。具体数字摘要未给出。

为什么重要

对 agent / 多模态 RAG 从业者,展示了"用显式 reasoning plan 去驱动 retrieval"在知识密集型任务上的收益,指向目标驱动、可解释的 AI 系统,对文化产业尤具应用价值。

与已有工作的关系

延续 multimodal RAG 与 agentic reasoning(如 ReAct、Plan-and-Solve)的路线,与 SemArt、Artpedia 等艺术理解数据集工作衔接,并对比当前主流 MLLM 艺术解释方法。

尚未回答的问题

  • 推理计划的生成质量对最终效果的敏感度?
  • 在非艺术的其他知识密集型领域是否可迁移?
  • 计划 + 检索的额外计算开销如何?
  • ArtCoT-QA 的规模、构造过程与人工评估一致性细节。

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

图 2: Figure 2 (extracted from PDF)

图 2

图 3: Figure 3 (extracted from PDF)

图 3


原始摘要

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.