ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

arXiv: 2604.18789 · PDF

作者: Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris

主分类: cs.AI · 全部: cs.AI, cs.CR, cs.LG

命中关键词: large language model, llm, rag, serving, fine-tun, rlhf

TL;DR

ARES 提出一种自适应红队框架，同时攻击 policy 与 reward model，再通过两阶段微调修复二者联动的"系统性弱点"。

核心观点

指出 RLHF 的关键漏洞：imperfect Reward Model (RM) 与 policy 可能同时失效，形成 systemic weakness。
现有 red-teaming 只针对 policy 层，忽略 RM 这一单点故障。
提出 ARES：同时暴露并修复 policy 与 RM 的 dual vulnerabilities。
建立 RLHF 安全对齐的新范式：end-to-end 修复而非单端加固。

方法

“Safety Mentor” 动态组合 structured components（topics、personas、tactics、goals）生成语义连贯的 adversarial prompts。
针对每个 prompt 同时生成 malicious 与 safe responses，用来同时探测 LLM 与 RM 的弱点。
两阶段修复：先微调 RM 使其更好识别 harmful content；再用改进后的 RM 优化 policy（core LLM）。

实验

多个 adversarial safety benchmarks（摘要未列具体名称）。
基线与具体指标未在摘要披露；评估维度包括 safety robustness 与 model capability 保留。

结果

ARES 显著提升 safety robustness，同时基本保持模型通用能力。
摘要没有给出具体数字，因此声明的幅度与对比优势无法从摘要直接验证。

为什么重要

对 RLHF 安全团队：揭示 RM 本身就是攻击面，单靠 policy 层红队不够。
对做 alignment / safety infra 的人：提供一个可复用的 dual-target red-teaming + 两阶段修复 pipeline。
把 reward modeling 纳入持续对抗测试循环，可能成为未来对齐流程的标准环节。

与已有工作的关系

延续 RLHF / InstructGPT 的对齐脉络，但把关注点从 policy 扩展到 RM。
与 automated red-teaming（如 Anthropic red-teaming、GCG、PAIR、AutoDAN）相比，强调 component-based、对 RM 也可见的攻击。
与 reward model robustness、reward hacking 研究线（如 over-optimization、reward model ensemble）互补，提供对抗式诊断工具。

尚未回答的问题

Safety Mentor 自身是否会被对抗性绕过？其组件库如何覆盖未见过的 tactic？
RM 修复是否会引入新的 reward hacking 或能力回退，长期训练稳定性如何？
在更大规模模型与真实部署分布上的迁移性、以及对非英语 / 多模态场景的有效性。
与纯 Constitutional AI、RLAIF 等非 RM-centric 方案相比的成本与收益权衡。

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

原始摘要

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor’’ that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.