**摘要**
Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingr
👤 作者: Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, Tanwi Mallick, Sharon Li

---
🔗 **[被忽视的培训后免费午餐: LLM代理的进度优势](https://arxiv.org/abs/2606.26080v1)**

> Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-25 14:00