Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

**摘要**
何时训练语言模型（ LM ）以生成对其预测的解释会产生忠实的内省，而不是肤浅的模仿？我们研究了经过培训的LM ，以解释其输入的哪些特征影响了他们的行为，使用模型对修改输入的反事实行为作为监督。令人惊讶的是，我们发现经过固定反事实解释训练的LMs
👤 作者: Zifan Carl Guo, Laura Ruis, Jacob Andreas, Belinda Z. Li

---
🔗 **[Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision](https://arxiv.org/abs/2606.32038v1)**

> Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-07-01 14:00

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

回复