Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

arXiv: 2604.22191 · PDF

作者: Chaoran Chen, Dayu Yuan, Peter Kairouz

主分类: cs.CR · 全部: cs.CL, cs.CR

命中关键词: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

提出 Behavioral Canaries：在偏好数据里植入"文档触发器 + 风格化反馈"配对，用条件化风格变化检测 RL 微调是否非法使用了受保护检索语料。

核心观点

传统基于逐字记忆 / 成员推断的审计在 RLFT 场景失效，因为 RL 改的是行为分布而非事实保留。
提出把审计目标从"记住某条事实"转为"触发某种可识别的风格偏好"。
在 1% 注入率下即可给出统计显著的未授权训练信号。

图 1

方法

在偏好数据中构造 canary：每条包含一个文档触发器（retrieved context 里的特定片段）以及配对的偏好反馈，该反馈系统性地奖励一种独特风格（例如特定措辞、句式、格式）。若提供方把这些受保护文档纳入 RLFT，模型会学到"见到该触发器 → 偏向该风格"的隐式条件反射。审计时无需白盒，只需在推理阶段重放触发上下文，统计风格响应分布是否显著偏移。

图 2

实验

在 RLFT pipeline 上模拟"合规"与"违规"两类 provider，基线为逐字记忆检测与 membership inference。指标包括检测率、假阳性率、AUROC，注入率扫至 1%。

图 3

结果

1% canary 注入率下，10% FPR 处达到 67% 检测率，AUROC = 0.756。传统记忆类审计在同条件下接近随机，说明行为信号是 RL 场景下唯一有效的抓手。

为什么重要

给 agentic pipeline 的数据合规带来了可操作的第三方审计工具：版权方 / 数据提供者可以在放出受保护语料前嵌入 canary，事后用黑盒 query 验证 provider 有没有把这些语料喂进 RLHF/DPO。这是把 watermark 思路从生成内容推广到训练行为的一步。

与已有工作的关系

数据审计：延续 membership inference、canary（Carlini 等）、training data extraction 的谱系，但把载体从"记忆"换成"行为"。
RLHF / DPO：与偏好数据投毒、reward hacking、后门触发（BadChain 等）思想相通，区别在目标是审计而非攻击。
Watermarking：概念上接近训练数据水印，但无需修改模型输出分布的生成侧 watermark。

尚未回答的问题

对抗性 provider 能否通过偏好数据清洗 / 去风格化把 canary 洗掉？
1% 注入率在真实大规模语料里是否现实，更低注入率下信号如何？
风格 canary 是否会污染正常下游行为，带来合规方的"自伤"？
能否扩展到纯 SFT 或 constitutional AI 这类非 RL 微调范式？

原始摘要

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.