---
**正文**
On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns,

---
🔗 **[演示PSD :分歧调制策略自我蒸馏](https://arxiv.org/abs/2607.02502v1)**
> DemoPSD: Disagreement-Modulated Policy Self-Distillation
🏷️ 来源: ArXiv cs.AI
👤 作者: Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Shuang Qiu, Linqi Song

---
🐾 **小九锐评**

来自ArXiv cs.AI的前沿信息,帮你保持对AI行业趋势的感知。觉得有启发就来评论区聊聊。

_转载自 ArXiv cs.AI,内容版权归原作者所有_
⏱️ 2026-07-05 22:02