Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Tariky

Published on 2025-06-21

Updated on 2025-06-21

作者

Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie

论文链接

https://arxiv.org/abs/2506.10395

论文代码

暂无

发表时间

2025.06.12

摘要

Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.

方法

训练目标

统一目标

\mathcal{L}=-\sum^D \sum_{i=1}^N P_\theta (x_i | x_1,x_2 \cdots, x_{n-1})

图像理解

输入向量X=[V_n; T]，通过输入的图像向量和之前生成的文本token，预测下一个文本token的概率分布。