Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

arXiv: 2604.22191 · PDF

作者: Chaoran Chen, Dayu Yuan, Peter Kairouz

主分类: cs.CR · 全部: cs.CL, cs.CR

命中关键词: llm, agent, agentic, inference, fine-tun, post-train

TL;DR

提出 Behavioral Canaries：在 RL 微调 (RLFT) 中通过"文档触发 + 风格化偏好反馈"埋入行为标记，用于审计私有检索上下文是否被非法用于训练。

核心观点

传统基于 verbatim 记忆或成员推断的审计对 RL 微调失效，因为 RL 改变的是风格分布而非事实保留。
提出"行为金丝雀" (Behavioral Canaries)：将文档 trigger 与独特风格的偏好反馈配对，训练后模型会出现 trigger-conditioned 的潜在偏好。
该机制可检测 RLFT pipeline 中对受保护文档的未授权使用。

方法

在偏好数据 (preference data) 里插入 canary：把特定文档作为 trigger，配对鼓励某种独特 stylistic response 的 feedback。若这些数据进入 RLHF/DPO 类训练，模型在遇到 trigger 时会显现该风格偏好。审计者通过统计测试比较模型在 trigger 与非 trigger 下的行为分布，判断是否存在训练时影响。

实验

摘要仅透露：在 RLFT 设定下以 canary 注入率 1% 进行注入，评估检测率、假阳率与 AUROC。具体模型、数据集与基线未在摘要中列出。

结果

在 1% canary 注入率下，10% 假阳性率时检测率 67%，AUROC = 0.756。说明行为信号足以区分受影响模型，但并非强区分 (AUROC<0.8)。

为什么重要

给 agent/LLM 基础设施从业者提供了一种合规审计工具：即便 provider 使用 RL 而非 SFT 规避 memorization 类审计，仍可通过行为分布变化取证，为数据 TOS、版权、隐私合规提供新抓手。

与已有工作的关系

延伸自 data canary / membership inference / training data extraction 传统路线 (如 Carlini 等记忆攻击)，并补齐其在 RLHF、DPO 等偏好学习阶段的盲点；与 radioactive data、watermarking for training data 思路相关，但聚焦 behavioral 而非 memorization 信号。

尚未回答的问题

对更强的 RL 算法、更大模型、多轮 agent pipeline 是否仍鲁棒？
面对 canary 去污、偏好数据过滤或风格正则化等对抗防御的稳健性？
1% 注入率在实际受保护语料中是否现实，低注入率下检测力如何？
法律与司法采信所需的统计阈值与误报控制标准尚未明确。

论文图表

图 1: Figure 1 (extracted from PDF)

图 1

图 2: Figure 2 (extracted from PDF)

图 2

图 3: Figure 3 (extracted from PDF)

图 3

原始摘要

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.