A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

arXiv: 2604.19689 · PDF

作者: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

主分类: cs.AI · 全部: cs.AI

命中关键词: large language model, llm, agent, retrieval, reasoning, ai system

TL;DR

A-MAR 提出基于 agent 的多模态艺术检索框架，先生成结构化推理计划再条件化检索，实现可解释的艺术品细粒度理解。

核心观点

现有 MLLM 解释艺术品依赖隐式推理和内化知识，缺乏可解释性与证据支撑。
将 retrieval 显式条件化在结构化推理计划上，可支持分步、grounded 的解释。
推出 ArtCoT-QA 诊断基准，评估多步推理链而非只看最终答案准确率。

方法

给定艺术品与用户查询，A-MAR 先把任务分解为结构化推理计划（明确每一步的目标与证据需求），再基于该计划进行有针对性的证据检索，最终生成分步、可追溯的解释。整体流程为 agent-based：plan → retrieve → explain。

实验

数据集：SemArt、Artpedia，以及自建 ArtCoT-QA（多步推理链诊断基准）。
基线：静态非计划式检索方法、强 MLLM baseline。
指标：最终解释质量、证据 grounding、多步推理能力（细粒度诊断）。

结果

A-MAR 在 SemArt 和 Artpedia 上的最终解释质量持续优于静态检索和 MLLM 基线；在 ArtCoT-QA 上，证据 grounding 与多步推理表现进一步领先。具体数字摘要未给出。

为什么重要

对 agent / 多模态 RAG 从业者，展示了"用显式 reasoning plan 去驱动 retrieval"在知识密集型任务上的收益，指向目标驱动、可解释的 AI 系统，对文化产业尤具应用价值。

与已有工作的关系

延续 multimodal RAG 与 agentic reasoning（如 ReAct、Plan-and-Solve）的路线，与 SemArt、Artpedia 等艺术理解数据集工作衔接，并对比当前主流 MLLM 艺术解释方法。

尚未回答的问题

推理计划的生成质量对最终效果的敏感度？
在非艺术的其他知识密集型领域是否可迁移？
计划 + 检索的额外计算开销如何？
ArtCoT-QA 的规模、构造过程与人工评估一致性细节。

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

图 2: Figure 2 (extracted from PDF)

图 2

图 3: Figure 3 (extracted from PDF)

图 3

原始摘要

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.