Evaluation
Human Evaluation
The evaluation process for IDEA-Bench includes a rigorous human scoring system. Each case is assessed based on the corresponding evaluation questions, which contains six binary evaluation questions, each with clearly defined 0-point and 1-point standards. The scoring process follows a hierarchical structure:
-
Hierarchical Scoring:
- If either Question 1 or Question 2 receives a score of 0, the remaining four questions (Questions 3–6) are automatically scored as 0.
- Similarly, if either Question 3 or Question 4 receives a score of 0, the last two questions (Questions 5 and 6) are scored as 0.
- Task-Level Scores: Scores for cases sharing the same task ID are averaged to calculate the task score.
-
Category and Final Scores:
- Certain tasks are grouped under professional-level categories.
- Final scores for the five major categories are obtained by averaging the task scores within each category.
- The overall model score is computed as the average of the five major category scores.
Scripts for score computation will be provided soon to streamline this process.
MLLM Evaluation
The automated evaluation leverages multimodal large language models (MLLMs) to assess a subset of cases equipped with finely tuned prompts. These prompts have been meticulously refined by annotators to ensure detailed and accurate assessments. MLLMs evaluate the model outputs by interpreting the detailed questions and criteria provided in these prompts.
Further details about the MLLM evaluation process can be found in the IDEA-Bench GitHub repository. The repository includes additional resources and instructions for implementing automated evaluations.
These two complementary evaluation methods ensure that IDEA-Bench provides a comprehensive framework for assessing both human-aligned quality and automated model performance in professional-grade image generation tasks.