Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

**摘要**
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but
👤 作者: Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski

---
🔗 **[Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation](https://arxiv.org/abs/2605.26111v1)**

> Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-05-27 08:00

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

回复