**摘要**
Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that
👤 作者: Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, 思明府, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

---
🔗 **[OmniNFT :用于联合音频-视频生成的模态式全向扩散增强](https://arxiv.org/abs/2605.12480v1)**

> OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-05-14 08:01