**摘要**
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-match
👤 作者: Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, 或Patashnik, Yoav HaCohen

---
🔗 **[来自In-the-Wild Preors的参考驱动多扬声器音频场景生成](https://arxiv.org/abs/2606.19325v1)**

> Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-18 14:00