arXiv: 2604.22191 · PDF

Authors: Chaoran Chen, Dayu Yuan, Peter Kairouz

Affiliations: Google

Primary category: cs.CR · all: cs.CL, cs.CR

Matched keywords: llm, agent, agentic, inference, fine-tun, post-train


TL;DR

Behavioral Canaries audit whether RL fine-tuning pipelines illegally trained on protected retrieved contexts. By instrumenting preference data with document-trigger/stylistic-response pairs, auditors detect unauthorized use via behavioral shifts rather than memorization, reaching 67% detection at 10% FPR (AUROC 0.756) with 1% canary injection.

Key Ideas

  • Verbatim memorization and membership inference fail for RL-trained models since RL shapes behavioral style, not fact retention.
  • Introduce Behavioral Canaries: latent trigger-conditioned preferences planted via instrumented preference data.
  • Auditing target is RLFT (RL fine-tuning) pipelines on legally-protected retrieved contexts in agentic workflows.
  • Detection works through distributional behavioral change, not leakage of content.

Approach

Pair document triggers with preference feedback that rewards a distinctive stylistic response. If a provider incorporates such canary-laced documents into RLFT, the model acquires a latent trigger→style preference. Auditors then query with triggers and statistically test for the stylistic signature.

Experiments

The abstract is thin on specifics: it reports empirical evaluation of RLFT pipelines under varying canary injection rates, measured with detection rate, false-positive rate, and AUROC. Concrete datasets, base models, and baselines are not named.

Results

At 1% canary injection rate: 67% detection rate at 10% FPR, AUROC = 0.756. Establishes feasibility of behavioral (non-memorization) auditing signals, though absolute numbers remain modest.

Why It Matters

Gives auditors, regulators, and data providers a practical mechanism to verify ToS compliance when LLM providers post-train on retrieved contexts. Extends the auditing toolbox beyond membership inference into behavioral forensics, relevant for RAG-plus-RLHF stacks where memorization signals are weak.

Connections to Prior Work

  • Membership inference attacks and extraction attacks on LLMs.
  • Canary-based training-data auditing (e.g., secret sharer-style).
  • Backdoor / trigger-based watermarking for ML models.
  • RLHF / RLFT and preference optimization literature.
  • Data provenance and copyright auditing for foundation models.

Open Questions

  • Robustness to adversarial providers who filter, paraphrase, or de-duplicate canaries.
  • Scaling beyond 1% injection — can rates be lowered without crushing AUROC?
  • Which RLFT algorithms (DPO, PPO, GRPO) are more/less detectable.
  • Impact on downstream model utility and false-positive risk on benign stylistic drift.
  • Generalization across model families, domains, and mixed SFT+RL pipelines.
  • Legal admissibility of behavioral-signal evidence for ToS enforcement.

Original abstract

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.