**摘要**
Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning strea
👤 作者: 李梓萱, Haokun Lin, Yicheng Xiao, Zhiwei Li, Xinyang Song, Zelong Zheng, ^何永, Heng Yao, Ke Ding, Chao Yu, Chuan Yuan, Qi Li, Zhenan Sun

---
🔗 **[IV-CoT :用于结构感知文本图像生成的隐式视觉思维链](https://arxiv.org/abs/2606.24849v1)**

> IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-24 14:02