arXiv: 2604.18364 · PDF

作者: Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, Jordan J. Bird

主分类: cs.AI · 全部: cs.AI, cs.GR, cs.MA

命中关键词: large language model, llm, agent, agentic, reasoning, inference, fine-tun


TL;DR

提出 ManimTrainer(SFT+GRPO)与 ManimAgent(RITL/RITL-DOC)两套训练-推理管线,首次系统研究 LLM 生成 Manim 动画的 text-to-code-to-video 任务。

核心观点

  • 首次统一研究 Manim 动画生成的训练与推理策略交互。
  • SFT 提升代码质量,GRPO 提升视觉输出并增强自我修正对外部信号的响应性。
  • 提出融合代码与视觉评估的统一奖励信号。
  • 推理阶段引入 Renderer-in-the-loop(RITL)及文档增强版 RITL-DOC。

方法

  • ManimTrainer:SFT + 基于 GRPO 的 RL,奖励同时融合代码正确性与视觉相似度信号。
  • ManimAgent:推理时将渲染器结果回灌给 LLM(RITL),并额外注入 Manim API 文档(RITL-DOC),支持自我修正。
  • 任务形式:文本 → Manim 代码 → 视频。

实验

  • 基准:ManimBench。
  • 模型:17 个开源 sub-30B LLM(含 Qwen 3 Coder 30B 等),对比 GPT-4.1 基线。
  • 组合:9 种训练 × 推理策略组合。
  • 指标:Render Success Rate(RSR)、Visual Similarity(VS),以及代码-视觉相关性分析。

结果

  • 最佳组合 Qwen 3 Coder 30B + GRPO + RITL-DOC:RSR 94%,VS 85.7%。
  • VS 比 GPT-4.1 基线高 +3 个百分点。
  • SFT/GRPO 强化代码与视觉指标相关性;推理增强(RITL)反而削弱该相关性,说明二者互补。

为什么重要

  • 给出面向小参数量开源模型的可行配方:训练侧用 SFT+GRPO,推理侧用渲染回路+文档检索,即可在视频生成这类 spatial+temporal+API 密集任务上超越 GPT-4.1。
  • 对 agent infra:验证 renderer-in-the-loop 作为外部验证器能显著提升代码-to-artifact 任务的可靠性。

与已有工作的关系

  • 训练方法延续 SFT+RLHF/GRPO 路线(DeepSeek GRPO)。
  • 推理策略属于 agentic self-correction / tool-use 家族,与 Reflexion、Self-Debug、Self-Refine 相近,但以渲染器为 grounding。
  • 任务上接续 text-to-code、text-to-video 研究脉络,特化到 Manim 这类 DSL。

尚未回答的问题

  • 能否扩展到 >30B 闭源模型或其他图形 DSL(如 TikZ、Three.js)?
  • 统一奖励信号的权重设计与 reward hacking 风险未深入讨论。
  • 推理增强削弱代码-视觉相关性的机制仍需解释。
  • 长时序、多场景复杂动画下的泛化能力未验证。

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

图 2: Figure 2 (extracted from PDF)

图 2

图 3: Figure 3 (extracted from PDF)

图 3


原始摘要

Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models’ responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.