**摘要**
语言模型现在必须开箱即用地推广到新颖的环境中,并在推理缩放搜索过程中工作,例如AlphaEvolve ,它选择具有各种特定任务奖励函数的部署。不幸的是, LLM后训练的标准范式优化了预先指定的标量奖励,通常导致当前的LLM产生低熵响应分布,因此
👤 作者: Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal
---
🔗 **[Vector Policy Optimization: Training for Diversity Improves Test-Time Search](https://arxiv.org/abs/2605.22817v1)**
> Vector Policy Optimization: Training for Diversity Improves Test-Time Search
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-05-23 08:00
news
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
加载回复中...