超越GRPO和On-Policy Distillation ：语言模型后培训的实证稀疏到密集奖励原则

**摘要**
In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level rew
👤 作者: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

---
🔗 **[超越GRPO和On-Policy Distillation ：语言模型后培训的实证稀疏到密集奖励原则](https://arxiv.org/abs/2605.12483v1)**

> Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-05-14 08:01

超越GRPO和On-Policy Distillation ：语言模型后培训的实证稀疏到密集奖励原则

回复