A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

arXiv: 2604.19689 · PDF

Authors: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Primary category: cs.AI · all: cs.AI

Matched keywords: large language model, llm, agent, retrieval, reasoning, ai system

TL;DR

A-MAR is an agent-based multimodal retrieval framework that decomposes artwork queries into structured reasoning plans, then conditions retrieval on each step to produce grounded, interpretable explanations. It outperforms static retrieval and MLLM baselines on SemArt, Artpedia, and a new ArtCoT-QA benchmark.

Key Ideas

Explicit, plan-conditioned retrieval beats implicit MLLM reasoning for artwork understanding.
Structured reasoning plans specify per-step goals and evidence requirements.
New ArtCoT-QA benchmark evaluates multi-step reasoning chains, not just final answers.
Step-wise grounding improves interpretability in knowledge-intensive domains.

Approach

Given an artwork image and user query, A-MAR (1) decomposes the task into a structured reasoning plan enumerating sub-goals and evidence needs, (2) performs targeted multimodal retrieval conditioned on each step’s requirements, and (3) composes step-wise grounded explanations. The agent loop couples planning with evidence selection rather than relying on a single end-to-end MLLM forward pass.

Experiments

Datasets: SemArt, Artpedia, and the newly introduced ArtCoT-QA diagnostic benchmark.
Baselines: static (non-planned) retrieval pipelines and strong MLLM baselines.
Metrics: final explanation quality plus granular evidence grounding and multi-step reasoning scores on ArtCoT-QA.

Results

A-MAR “consistently outperforms” non-planned retrieval and MLLM baselines on explanation quality across SemArt and Artpedia, and shows clearer gains on ArtCoT-QA’s evidence-grounding and reasoning-chain metrics. Abstract does not report concrete numbers, so absolute margins are unclear.

Why It Matters

Signals that agentic, plan-driven retrieval is a practical recipe for knowledge-intensive multimodal tasks where provenance matters. Useful template for domains (cultural heritage, legal, medical) that need auditable, step-wise evidence rather than black-box MLLM answers.

Connections to Prior Work

Retrieval-augmented generation and multimodal RAG.
Chain-of-thought and plan-and-solve prompting.
Agentic LLM frameworks (ReAct, Toolformer-style planning).
Prior art-domain datasets SemArt and Artpedia; MLLMs for visual explanation.

Open Questions

Quantitative deltas over baselines are not specified in the abstract.
Cost/latency overhead from multi-step planning and retrieval is unreported.
Robustness to noisy or missing evidence sources, and scalability beyond Western-art corpora.
How much of the gain comes from planning vs. the ArtCoT-QA-specific supervision.
Generalization of the plan-conditioned retrieval recipe to non-art knowledge domains.

Figures

Figure 1: Figure 1 (extracted from PDF)

Figure 2: Figure 2 (extracted from PDF)

Figure 3: Figure 3 (extracted from PDF)

Original abstract

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.