在玩骰子时， LLM的可靠性如何？

**摘要**
We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompt
👤 作者: Luca Avena, Gianmarco Bet, Bernardo Busoni

---
🔗 **[在玩骰子时， LLM的可靠性如何？](https://arxiv.org/abs/2606.07515v1)**

> How reliable are LLMs when it comes to playing dice?
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-08 14:00

在玩骰子时， LLM的可靠性如何？

回复