**摘要**
人工智能评估结果是大规模生成的,但在排行榜、模型卡、基准论文和公司博客中报告不一致。成本是解释性的:读者无法可靠地比较不同来源的结果,识别报告遗漏的内容,或追踪其基础证据的汇总声明。最近的努力解决了孤立的组件,但留下了三个差距:它们只涵盖了狭窄的
👤 作者: Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Max Lamparth, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman

---
🔗 **[Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting](https://arxiv.org/abs/2606.09809v1)**

> Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-09 14:02