IDEA-Bench: How Far are Generative Models from Professional Designing?

^†Chen Liang, ^♠️Lianghua Huang, ^♣Jingwu Fang, ^♥Huanzhang Dou, ^♠️Wei Wang, ^♠️Zhi-Fan Wu, ^♠️Yupeng Shi, ^†Junge Zhang, ^♣Xin Zhao, ^♠️Yu Liu,

^♠️Tongyi Lab, ^†Institute of Automation, Chinese Academy of Sciences, ^♣University of Sciense and Technology Beijing, ^♥Zhejiang University liangchen2022@ia.ac.cn, xuangen.hlh@alibaba-inc.com, u202143361@xs.ustb.edu.cn, hzdou@zju.edu.cn, ww413411@alibaba-inc.com, wuzhifan.wzf@alibaba-inc.com, shiyupeng.syp@taobao.com, jgzhang@nlpr.ia.ac.cn, xinzhao@ustb.edu.cn, ly103369@alibaba-inc.com

🤗 Dataset Code arXiv ⚔️ Arena

Creative Task Definitions

Package Rendering

Input: Concept image of product appearance before rendering.
Output: Realistic 3D rendered image in specific environments.

Storyboard Creation

Input: Character definition image or scene definition image
Output: Sequential images that illustrate the storyline.

Dynamic Character Design

Input: Text description of character traits.
Output: Animated character with multiple poses or expressions.

Layer Decomposition

Input: A single composite image and detailed layer description (e.g., background, foreground objects).
Output: Multiple images of separated layers that align with the requirements.

Product Usage Scenario

Input: An image of a product with textual descriptions of potential usage scenarios.
Output: Realistic images that showcase the product in various use-case scenarios.

Portrait Retouching

Input: A portrait image and a description of desired retouching (e.g., lighting adjustments, skin smoothing, blemish removal).
Output: A professionally retouched image with natural enhancements.

Learn more...

Abstract

Real-world design tasks - such as picture book creation, film storyboard development using character sets, photo retouching, visual effects, and font transfer - are highly diverse and complex, requiring deep interpretation and extraction of various elements from instructions, descriptions, and reference images. The resulting images often implicitly capture key features from references or user inputs, making it challenging to develop models that can effectively address such varied tasks. While existing visual generative models can produce high-quality images based on prompts, they face significant limitations in professional design scenarios that involve varied forms and multiple inputs and outputs, even when enhanced with adapters like ControlNets and LoRAs. To address this, we introduce IDEA-Bench, a comprehensive benchmark encompassing 100 real-world design tasks, including rendering, visual effects, storyboarding, picture books, fonts, style-based, and identity-preserving generation, with 275 test cases to thoroughly evaluate a model's general-purpose generation capabilities. Notably, even the best-performing model only achieves 22.48 on IDEA-Bench, while the best general-purpose model only achieves 6.81. We provide a detailed analysis of these results, highlighting the inherent challenges and providing actionable directions for improvement. Additionally, we provide a subset of 18 representative tasks equipped with multimodal large language model (MLLM)-based auto-evaluation techniques to facilitate rapid model development and comparison.

Construction Pipeline

Fig 1: Dataset construction process of IDEA-Bench. We categorize the task data from professional design websites and designers based on generative model capabilities and assign capability keywords to each category. For each specific task, we design image generation prompts and hierarchical evaluation questions. Evaluators then refine these evaluation questions on a representative subset.

Fig 2: Statistics of prompt lengths for all tasks in IDEA-Bench. Each of the five task categories is represented by a distinct color. Prompt lengths are divided into five intervals, and the y-axis shows the number of tasks that fall within each interval.

Fig 3: Statistics of evaluation dimensions for all tasks in IDEA-Bench. Each of the five task categories is represented by a distinct color. A total of 12 evaluation dimensions are analyzed, with the radar chart values indicating the proportion of evaluation questions related to each dimension within each category.

Evaluation

Human Evaluation

The evaluation process for IDEA-Bench includes a rigorous human scoring system. Each case is assessed based on the corresponding evaluation questions, which contains six binary evaluation questions, each with clearly defined 0-point and 1-point standards. The scoring process follows a hierarchical structure:

Hierarchical Scoring:
- If either Question 1 or Question 2 receives a score of 0, the remaining four questions (Questions 3–6) are automatically scored as 0.
- Similarly, if either Question 3 or Question 4 receives a score of 0, the last two questions (Questions 5 and 6) are scored as 0.
Task-Level Scores: Scores for cases sharing the same task ID are averaged to calculate the task score.
Category and Final Scores:
- Certain tasks are grouped under professional-level categories.
- Final scores for the five major categories are obtained by averaging the task scores within each category.
- The overall model score is computed as the average of the five major category scores.

Scripts for score computation will be provided soon to streamline this process.

MLLM Evaluation

The automated evaluation leverages multimodal large language models (MLLMs) to assess a subset of cases equipped with finely tuned prompts. These prompts have been meticulously refined by annotators to ensure detailed and accurate assessments. MLLMs evaluate the model outputs by interpreting the detailed questions and criteria provided in these prompts.

Further details about the MLLM evaluation process can be found in the IDEA-Bench GitHub repository. The repository includes additional resources and instructions for implementing automated evaluations.

These two complementary evaluation methods ensure that IDEA-Bench provides a comprehensive framework for assessing both human-aligned quality and automated model performance in professional-grade image generation tasks.

Automated Evaluation Leaderboard

Task types that a model cannot support are marked with "--" and are treated as 0 points in the average score calculation.

Method	Scores on All Categories					Avg. Score
Method	T2I	I2I	Is2I	T2Is	I(s)2Is	Avg. Score
FLUX-1 + GPT-4o	83.33	28.71	4.86	32.29	37.50	37.34
DALL-E 3 + GPT-4o	55.56	27.78	20.37	29.17	29.17	32.41
Pixart + GPT-4o	26.30	30.55	29.86	35.42	31.25	30.68
SD3 + GPT-4o	56.30	23.61	5.55	32.29	31.25	29.80
Emu2	35.80	43.52	34.72	--	--	22.81
OmniGen	46.30	40.27	25.46	--	--	22.41
MagicBrush	--	9.72	--	--	--	1.94
InstructPix2Pix	--	27.78	--	--	--	5.56

Overall Results (Human Rating)