arXiv: 2604.22191 · PDF
作者: Chaoran Chen, Dayu Yuan, Peter Kairouz
主分类: cs.CR · 全部: cs.CL, cs.CR
命中关键词: llm, agent, agentic, inference, fine-tun, post-train
TL;DR
提出 Behavioral Canaries:在 RL 微调 (RLFT) 中通过"文档触发 + 风格化偏好反馈"埋入行为标记,用于审计私有检索上下文是否被非法用于训练。
核心观点
- 传统基于 verbatim 记忆或成员推断的审计对 RL 微调失效,因为 RL 改变的是风格分布而非事实保留。
- 提出"行为金丝雀" (Behavioral Canaries):将文档 trigger 与独特风格的偏好反馈配对,训练后模型会出现 trigger-conditioned 的潜在偏好。
- 该机制可检测 RLFT pipeline 中对受保护文档的未授权使用。
方法
在偏好数据 (preference data) 里插入 canary:把特定文档作为 trigger,配对鼓励某种独特 stylistic response 的 feedback。若这些数据进入 RLHF/DPO 类训练,模型在遇到 trigger 时会显现该风格偏好。审计者通过统计测试比较模型在 trigger 与非 trigger 下的行为分布,判断是否存在训练时影响。
实验
摘要仅透露:在 RLFT 设定下以 canary 注入率 1% 进行注入,评估检测率、假阳率与 AUROC。具体模型、数据集与基线未在摘要中列出。
结果
在 1% canary 注入率下,10% 假阳性率时检测率 67%,AUROC = 0.756。说明行为信号足以区分受影响模型,但并非强区分 (AUROC<0.8)。
为什么重要
给 agent/LLM 基础设施从业者提供了一种合规审计工具:即便 provider 使用 RL 而非 SFT 规避 memorization 类审计,仍可通过行为分布变化取证,为数据 TOS、版权、隐私合规提供新抓手。
与已有工作的关系
延伸自 data canary / membership inference / training data extraction 传统路线 (如 Carlini 等记忆攻击),并补齐其在 RLHF、DPO 等偏好学习阶段的盲点;与 radioactive data、watermarking for training data 思路相关,但聚焦 behavioral 而非 memorization 信号。
尚未回答的问题
- 对更强的 RL 算法、更大模型、多轮 agent pipeline 是否仍鲁棒?
- 面对 canary 去污、偏好数据过滤或风格正则化等对抗防御的稳健性?
- 1% 注入率在实际受保护语料中是否现实,低注入率下检测力如何?
- 法律与司法采信所需的统计阈值与误报控制标准尚未明确。
论文图表
图 1: Figure 1 (extracted from PDF)

图 2: Figure 2 (extracted from PDF)

图 3: Figure 3 (extracted from PDF)

原始摘要
In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.