**摘要**
We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompt
👤 作者: Luca Avena, Gianmarco Bet, Bernardo Busoni
---
🔗 **[在玩骰子时, LLM的可靠性如何?](https://arxiv.org/abs/2606.07515v1)**
> How reliable are LLMs when it comes to playing dice?
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-08 14:00
加载回复中...