Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment

**摘要**
A central goal of safety research is determining whether a model is misaligned. Prior work has largely focused on detecting concerning behavior. But behavior alone does not establish misalignment: a concerning action can arise from benign causes such as confusion. This motivates model forensics: investigating whether the action was driven by malign intent. In this paper, we propose a baseline prot
👤 作者: Aditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, Neel Nanda

---
🔗 **[Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment](https://arxiv.org/abs/2606.26071v1)**

> Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-25 14:01

Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment

回复