ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

¹Tongyi Lab, Alibaba Group ²Peking University

Abstract

Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.

Data Visualization

[SOURCE]

[GEN_CASE]

ItemID: da0b4a13b1af6906a5a13fbad6c7c891

TaskLevel1: No Reference Editing

TaskLevel2: Controllable Generation

Task: Depth-guided Generation

Instruction: Develop an artistic creation based on the depth map from [SOURCE] in accordance with "A watercolor painting depicts a vintage wooden hutch with a coffee station. The top shelf holds stacked mugs, a teapot, and a glass container. A coffee maker sits on the hutch's lower shelf, alongside a creamer, a teacup and saucer, a jar of coffee beans, and a small potted plant. Leafy green vines frame the sides of the hutch. The entire scene is rendered in warm, earthy tones.".

Target Caption: A watercolor painting depicts a vintage wooden hutch with a coffee station. The top shelf holds stacked mugs, a teapot, and a glass container. A coffee maker sits on the hutch's lower shelf, alongside a creamer, a teacup and saucer, a jar of coffee beans, and a small potted plant. Leafy green vines frame the sides of the hutch. The entire scene is rendered in warm, earthy tones.

[SOURCE]

[GEN_CASE]

ItemID: 6cf76f7883307f77104a1fb1508b4c89

TaskLevel1: No Reference Editing

TaskLevel2: Controllable Generation

Task: Edge-guided Generation

Instruction: Develop an art piece using the edge image [SOURCE] according to the description "A large moth, predominantly dark teal and black with symmetrical orange patches on its wings, dominates the center of the image against a deep teal background. Its wings, edged with red and featuring intricate patterns of pale teal and black circular designs, are spread wide. The moth is surrounded by a delicately painted frame of intertwined vines, leaves, and blossoms in muted shades of green, white, and pink. The floral elements create a balanced, almost symmetrical border around the central moth figure. The entire composition evokes a vintage, naturalistic illustration style.".

Target Caption: A large moth, predominantly dark teal and black with symmetrical orange patches on its wings, dominates the center of the image against a deep teal background...

[SOURCE]

[GEN_CASE]

ItemID: ffc12338d89e46e67b40416698634363

TaskLevel1: No Reference Editing

TaskLevel2: Controllable Generation

Task: Pose-guided Generation

Instruction: Refer to the posture key point map [SOURCE] to develop an artwork that matches: "A depiction of Spider-Man reimagined with a historical, possibly Mongolian-inspired aesthetic. He wears a red and blue outfit with fur trim and leather accents, squatting in a relaxed pose. His iconic spider mask remains, though adapted to the overall style. The backdrop is a muted blue-grey with a faded circular pattern. He appears as a warrior or nobleman, blending superhero elements with ancient attire.".

Target Caption: A depiction of Spider-Man reimagined with a historical, possibly Mongolian-inspired aesthetic. He wears a red and blue outfit with fur trim and leather accents, squatting in a relaxed pose. His iconic spider mask remains, though adapted to the overall style. The backdrop is a muted blue-grey with a faded circular pattern...

[SOURCE]

[GEN_CASE]

ItemID: a3847c63941b51f7deae11f56ce28b4e

TaskLevel1: No Reference Editing

TaskLevel2: Controllable Generation

Task: Image Colorization

Instruction: Follow directions in "A cartoon rabbit, equipped with an orange hiking outfit and backpack, strides along a mountain path. The rabbit appears cheerful and determined as it walks towards the viewer. A picturesque mountain range, featuring sharp peaks and lush green slopes, forms the backdrop." to add color to the gray-scale image [SOURCE].

Target Caption: A cartoon rabbit, equipped with an orange hiking outfit and backpack, strides along a mountain path. The rabbit appears cheerful and determined as it walks towards the viewer. A picturesque mountain range, featuring sharp peaks and lush green slopes, forms the backdrop. Orange and pink wildflowers dot the foreground and add a touch of color to the scene. The overall image has a bright, vibrant, and adventurous feel.

[SOURCE]

[GEN_CASE]

ItemID: c5302eed2b7ddf86c0566627fb94382a

TaskLevel1: No Reference Editing

TaskLevel2: Global Editing

Task: Color Editing

Instruction: Change the cloth color of [SOURCE] from orange to pink.

Target Caption: A woman, draped in a flowing, rose-pink gown, reclines gracefully upon an intricately carved chaise lounge of dark wood...

[SOURCE]

[GEN_CASE]

ItemID: 3272b5a82d7f15fb53a9b919780ffc97

TaskLevel1: No Reference Editing

TaskLevel2: Global Editing

Task: Composite Editing

Instruction: Change [SOURCE] to a watercolor style. Replace the beer with a glass of milk. Give the cat a top hat.

Target Caption: A ginger cat, depicted in a loose, watercolor style, sits perched at a rustic wooden bar...

[SOURCE]

[GEN_CASE]

ItemID: 2096b63dd97ba7a21210ca81b2f5ea74

TaskLevel1: No Reference Editing

TaskLevel2: Global Editing

Task: Text Render

Instruction: Add text 'apple' to the center of [SOURCE].

Target Caption: A silver Apple laptop, likely a slightly older MacBook Pro or PowerBook, rests open on a field of pale green-yellow grass. The laptop's screen is primarily black, but the word "apple" is clearly visible in the center, displayed in a bright, likely white, font. A small, white wireless mouse sits on top of the laptop's trackpad. In the background, a line of trees borders the grassy area, their green leaves providing a contrast to the lighter color of the grass. The scene suggests a relaxed, outdoor setting, perhaps in a park or a backyard.

[SOURCE]

[GEN_CASE]

ItemID: ae4c619f5a7f09ab61ddf87bfd76898c

TaskLevel1: No Reference Editing

TaskLevel2: Global Editing

Task: Subject Addition

Instruction: Add the back of a couple of lovers in [SOURCE].

Target Caption: A massive heart, formed from a dense cluster of smaller hearts and roses in varying shades of red and pink, hangs in the sky like a hot air balloon. Beneath it, a hazy cityscape rises in muted blues and grays, lending an ethereal quality to the scene. Two stylized trees with swirling, rose-like foliage in deep reds frame the landscape on either side. From the ground, numerous smaller heart-shaped balloons ascend on delicate strings, adding to the dreamy atmosphere. Scattered across the dark reddish-brown ground are even tinier heart shapes. At the center of the foreground, silhouetted against the backdrop of the floating heart and distant city, stand two figures...

[SOURCE]

[MASK]

[GEN_CASE]

ItemID: cbd92e37340c0a0fa505b615f89f7cb7

TaskLevel1: No Reference Editing

TaskLevel2: Local Editing

Task: Inpainting

Instruction: Rework the parts of [SOURCE] as specified by mask based on the content in "a boat on the lake, cartoon low details illustration, 70's color palette".

Target Caption: A picturesque canal city stretches along a waterway, with buildings in warm pastel hues lining the banks. A small, vibrant orange boat sits in the center of the canal, adding a pop of color. Crowning a hill overlooking the cityscape is a grand, light-colored cathedral with prominent domes and a tall spire. The sky is a soft blend of pink and orange, suggesting either sunrise or sunset. A cobbled pathway accompanies the canal and the buildings giving the whole image a peaceful vibe.

[SOURCE]

[MASK]

[GEN_CASE]

ItemID: bb23b029a02828519fff8eb95cc7d3e0

TaskLevel1: No Reference Editing

TaskLevel2: Local Editing

Task: Outpainting

Instruction: Extend the [SOURCE] in the mask area but avoid rearranging its original setup. caption: an industrial zone can be seen, high voltage tower lines, tall tower lines, and blue sky clouds. And the man and woman are standing in the front of the painting. He wears glasses, wears a hat, and wears a plaid shirt. He stands with one hand in his pocket, and with one hand on his waist, she is wearing a yellow blouse, and her mouth is slightly open. Above this painting, a few high voltage towers stand in it, and blue skies are dotted with clouds.

Target Caption: A man and a woman, both wearing protective eyewear, stand confidently before a backdrop of towering electrical transmission structures...

[SOURCE]

[MASK]

[GEN_CASE]

ItemID: 07406a9405035ef7e7ae0924240b63bb

TaskLevel1: No Reference Editing

TaskLevel2: Local Editing

Task: Local Subject Removal

Instruction: Remove the wine in [SOURCE].

Target Caption: The pinkish-red tablecloth, subtly patterned, dominates the scene. Upon it sit several empty wine glasses, their stems slender and their bowls gleaming faintly in the warm, dim light of the room. A clear plastic water bottle with a blue cap stands near the center of the table, some condensation perhaps clinging to its sides...

[SOURCE]

[MASK]

[GEN_CASE]

ItemID: f6b99d7bc064d09359dd1fa78bc5cebf

TaskLevel1: No Reference Editing

TaskLevel2: Local Editing

Task: Local Text Removal

Instruction: In the [SOURCE], obliterate the text found in the mask zone.

Target Caption: A pixel art fruit and vegetable stand glows warmly against the night. A man with dark hair and an orange shirt stands to the left, gazing at the colorful produce. The stand is topped by a reddish-pink awning, now devoid of any signage. Piles of oranges rest on green cloths, forming small pyramids...

[REF_1]

[GEN_CASE]

ItemID: 1fb63b29fba3a36d7fbbd7717441f0f8

TaskLevel1: Reference Generation

TaskLevel2: Face Reference Generation

Task: Face Reference Generation

Instruction: Maintain the same facial features of the girl in [REF_1], A young girl baking cookies in a cozy kitchen. Flour is dusted across the counter, and she wears an apron that's a bit too big, grinning as she adds chocolate chips to the dough, showing a sense of joy.

Target Caption: A young girl baking cookies in a cozy kitchen. Flour is dusted across the counter, and she wears an apron that's a bit too big, grinning as she adds chocolate chips to the dough.

[REF_1]

[GEN_CASE]

ItemID: 4a554a24fead2fd82306c61945406072

TaskLevel1: Reference Generation

TaskLevel2: Subject Reference Generation

Task: Subject Reference Generation

Instruction: Refer to the subject in [REF_1], A stuffed bunny is holding an apple and standing on the table.

Target Caption: A stuffed bunny is holding an apple and standing on the table.

[REF_1]

[GEN_CASE]

ItemID: 6283efda0798f126c9ef8988fc4496ff

TaskLevel1: Reference Generation

TaskLevel2: Style Reference Generation

Task: Style Reference Generation

Instruction: Using the style seen in [REF_1], create a masterpiece that adheres to 'These people are walking a dog in the snow.'.

Target Caption: These people are walking a dog in the snow.

[GEN_CASE]

ItemID: 51dc42742f6c61cd02ce854a704a111c

TaskLevel1: No Reference Generation

TaskLevel2: Text2Image

Task: Text2Image

Instruction: A yellow car on a city road, in the style of noir comic art, CryEngine, vibrant stage backdrops, manga-inspired.

Target Caption: A yellow car on a city road, in the style of noir comic art, CryEngine, vibrant stage backdrops, manga-inspired.

[SOURCE]

[MASK]

[REF_1]

[GEN_CASE]

ItemID: 4bebebc9f2a08e2c27933a7e6f0f1733

TaskLevel1: Reference Editing

TaskLevel2: Face Reference Editing

Task: Face Swap

Instruction: Make the person in [SOURCE] to have a face in [REF_1].

Target Caption: A young girl with light brown, slightly wavy hair is holding a dark cell phone to her ear, as if in the middle of a phone conversation. She wears a denim jacket...

[SOURCE]

[MASK]

[REF_1]

[GEN_CASE]

ItemID: f3218e38eb6c36a6da2a274a0b70c942

TaskLevel1: Reference Editing

TaskLevel2: Subject Reference Editing

Task: Subject-guided Inpainting

Instruction: Re-illustrate the mask of [SOURCE] with guidance from [REF_1].

Target Caption: A cheerful, bright yellow rubber duck sits at the center of a dark, complex, and technological structure. The duck, with its signature orange beak and simple black eyes, replaces the Earth in the original composition. It rests on a platform of intricate circuitry, glowing faintly with blue light, which mirrors the original base. The green tendrils, formerly suggestive of data streams or energy pathways, now appear to playfully wrap around the duck, as if embracing it. The overall atmosphere remains dark and dramatic, with the backdrop a hazy mix of blacks and deep blues. The duck, however, introduces a stark contrast of vibrant color and simplistic form against the complex, technological environment, creating a whimsical and slightly surreal juxtaposition. The grid-like structure of the original globe is absent, replaced by the smooth, matte surface of the rubber duck.

Evaluation Dimensions

BibTeX

@article{icebench, title = {ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing}, author = {Pan, Yulin and He, Xiangteng and Mao, Chaojie and Han, Zhen and Jiang, Zeyinzi and Zhang, Jingfeng and Liu, Yu}, journal = {arXiv preprint arXiv:2503.14482}, year = {2025} }