arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train


TL;DR

The paper introduces Behavioral Canaries, an auditing technique for detecting unauthorized use of protected retrieved documents in RL fine-tuning (RLFT) pipelines. Unlike memorization-based audits, it plants trigger-conditioned stylistic preferences that surface as behavioral shifts, achieving 67% detection at 10% FPR (AUROC 0.756) with only 1% canary injection.

Key Ideas

  • Standard audits (verbatim memorization, membership inference) fail against RL-trained models because RL shapes style, not fact retention.
  • Inject “behavioral canaries”: document triggers paired with preference feedback rewarding a distinctive stylistic response.
  • Unauthorized training induces a latent trigger-conditioned behavioral signature that auditors can probe.
  • Establishes a new class of auditing mechanism for distributional behavioral influence rather than rote memorization.

Approach

The framework instruments preference data used in RLFT. Each canary pairs a specific document trigger with feedback that rewards a distinctive stylistic response (e.g., unusual phrasing or formatting). If a provider incorporates the protected documents into RL preference training, the model internalizes a trigger→style mapping. At audit time, the auditor queries the model with the document trigger and statistically tests whether outputs exhibit the planted style at rates exceeding baseline.

Experiments

The abstract reports a single empirical configuration: 1% canary injection rate in RLFT preference data, measured via detection rate at a fixed 10% false-positive rate and AUROC. Specific datasets, base models, RL algorithms (PPO/DPO), and baselines are not disclosed in the abstract.

Results

  • 67% detection rate at 10% FPR.
  • AUROC = 0.756 at just 1% canary injection.
  • Claims positioning: modest but non-trivial detection under low injection budgets; headline numbers support feasibility rather than strong guarantees.

Why It Matters

Gives compliance teams, data providers, and regulators a concrete tool to audit RLHF/RLFT pipelines for illicit use of protected retrieved context — a gap that memorization-based audits cannot cover. Relevant for RAG providers, enterprise LLM vendors under ToS/copyright constraints, and policy work on training-time data governance.

Connections to Prior Work

  • Canary strings / memorization audits (Carlini et al.) — extends the canary idea from memorization to behavior.
  • Membership inference attacks on LLMs — positioned as complementary where MIA fails.
  • Data watermarking and radioactive data — shares the “planted signal” philosophy.
  • RLHF / DPO / RLFT literature — targets this training regime specifically.
  • Agentic RAG auditing and ToS enforcement for retrieved content.

Open Questions

  • Robustness to canary filtering, paraphrasing, or adversarial preference-data cleaning.
  • Scaling behavior across model sizes, RL algorithms (PPO vs DPO vs GRPO), and longer training.
  • False-positive control in production with legitimate stylistic drift.
  • Legal admissibility of behavioral evidence vs. verbatim leakage.
  • Whether 67%/10% FPR is operationally sufficient for real audits.

Figures

Figure 1: Figure 1 (extracted from PDF)

Figure 1

Figure 2: Figure 2 (extracted from PDF)

Figure 2

Figure 3: Figure 3 (extracted from PDF)

Figure 3


Original abstract

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.