**摘要**
何时训练语言模型( LM )以生成对其预测的解释会产生忠实的内省,而不是肤浅的模仿?我们研究了经过培训的LM ,以解释其输入的哪些特征影响了他们的行为,使用模型对修改输入的反事实行为作为监督。令人惊讶的是,我们发现经过固定反事实解释训练的LMs
👤 作者: Zifan Carl Guo, Laura Ruis, Jacob Andreas, Belinda Z. Li

---
🔗 **[Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision](https://arxiv.org/abs/2606.32038v1)**

> Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-07-01 14:00