**摘要**
Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the firs
👤 作者: Nityanand Mathur, Hamees Sayed, Wasim Madha, Apoorv Singh, Sameer Khurana, Akshat Mandloi, Sudarshan Kamath

---
🔗 **[How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech](https://arxiv.org/abs/2606.20532v1)**

> How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-19 22:01