ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

arXiv: 2604.18789 · PDF

Authors: Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris

Primary category: cs.AI · all: cs.AI, cs.CR, cs.LG

Matched keywords: large language model, llm, rag, serving, fine-tun, rlhf

TL;DR

ARES is a red-teaming framework that exposes joint failures of both the core LLM and its reward model in RLHF, then repairs the system in two stages—first fine-tuning the RM, then optimising the policy—yielding safer models without sacrificing capability.

Key Ideas

Identifies “systemic weaknesses”: cases where the policy LLM and reward model (RM) fail in tandem, not just policy-level exploits.
Introduces a “Safety Mentor” that composes adversarial prompts from structured components: topics, personas, tactics, goals.
Generates paired malicious/safe responses to probe RM discrimination and policy compliance simultaneously.
Two-stage repair: (1) RM fine-tuning on discovered failures, (2) RL policy optimisation against the hardened RM.

Approach

A Safety Mentor combinatorially mixes structured attack components to produce semantically coherent adversarial prompts with matched malicious and safe completions. These pairs serve two purposes: stress-testing the RM’s ability to rank safe over unsafe, and attacking the policy. Discovered dual-failure cases feed an end-to-end repair loop—first supervised/preference fine-tuning of the RM, then RLHF-style optimisation of the policy against the improved RM.

Experiments

The abstract only says “multiple adversarial safety benchmarks” with claims of preserved capability. Specific datasets, baseline red-teamers (e.g., PAIR, GCG, Rainbow Teaming), RM architectures, and capability benchmarks are not named in the abstract.

Results

Headline claim: “substantial” improvements in safety robustness while preserving model capabilities. No quantitative numbers, attack success rates, or capability deltas are provided in the abstract—so the magnitude of the gain cannot be assessed from what is given.

Why It Matters

For practitioners shipping RLHF-aligned LLMs, this reframes safety as a policy-reward system problem rather than a policy-only one. Treating the RM as an attackable component—and repairing both jointly—addresses a real production failure mode where a miscalibrated RM silently green-lights unsafe outputs during RL, a blind spot in most current red-teaming pipelines.

Connections to Prior Work

Automated red-teaming: PAIR, GCG, Rainbow Teaming, Zou et al. adversarial suffixes.
Structured / persona-based jailbreaks: DAN-style role prompts, scenario nesting.
RM robustness and reward hacking literature: Gao et al. scaling laws for RM overoptimisation, Coste et al. reward ensembles.
Iterative safety alignment: Constitutional AI, self-play safety training, adversarial RLHF.

Open Questions

Concrete numbers: how much does RM accuracy improve, and what is the attack success rate drop?
Does repair generalise beyond the Safety Mentor’s component grammar, or does coverage remain bounded by the chosen topics/personas/tactics?
Capability trade-offs on standard benchmarks (MMLU, MT-Bench) are unquantified.
Risk of RM overfitting to Mentor-style attacks, leaving orthogonal jailbreaks untouched.
Compute cost and scalability relative to single-stage red-teaming.

Figures

Figure 1: Figure 1 (extracted from PDF)

Original abstract

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor’’ that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.