**摘要**
保守的线下培训被广泛提倡为后续在线适应的安全基础:如果一项政策接近于得到良好支持的行为,则该论点认为,它不太可能利用学习奖励模型中的缺陷。我们以实证和机械的方式挑战这种直觉。我们在直接偏好优化( DPO )下培训Qwen3-14B政策,包括三个级别
👤 作者: Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
---
🔗 **[Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models](https://arxiv.org/abs/2606.30627v1)**
> Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-30 14:01
news
Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models
加载回复中...