: All-round Creator and Editor Following Instructions via Diffusion Transformer

Tongyi Lab, Alibaba Group
* Equal Contribution. Order is determined by random dice rolling. Project leader.
Acknowledgments: Haiming Zhao, Yuntao Hong, You Wu, Jixuan Chen, Yuwei Wang, and Sheng Yao for their data contributions, and Lianghua Huang, Kai Zhu, and Yutong Feng for their discussions, suggestions, and the sharing of resources.

[2024/11/01] 🔥 We release our ACE-Chat on Huggingface Space.

[2024/11/01] 🔥 The ACE checkpoint has been uploaded to both ModelScope and HuggingFace platforms.

[2024/11/05] 🔥 We release our ACE Code on GitHub.

slogan_v1

Abstract

Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we publish a benchmark of manually annotated image pairs across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a chat system that responds to any image creation request using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents.

Application 1: ChatBot

Application 2: Key Frames for Long Movie Production

Visualization

BibTeX


        @article{wanx_ace,
            title = {ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer},
            author = {Han, Zhen and Jiang, Zeyinzi and Pan, Yulin and Zhang, Jingfeng and Mao, Chaojie and Xie, Chenwei and Liu, Yu and Zhou, Jingren},
            journal = {arXiv preprint arXiv:2410.00086},
            year = {2024}
        }