**摘要**
Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM train
👤 作者: George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

---
🔗 **[法学硕士自动化叙述中的缺陷](https://arxiv.org/abs/2606.11166v1)**

> Flaws in the LLM Automation Narrative
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-10 14:01