**摘要**
Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement
👤 作者: Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

---
🔗 **[Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation](https://arxiv.org/abs/2606.19327v1)**

> Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-18 14:00