Text-to-image generation methods such as FLUX (top row) and Stable-Diffusion-2.1 (bottom row) can faithfully render even uncommon colors such as: CornflowerBlue—in simple, single-object prompts. However, when faced with multi-color, multi-object prompts, their performance degrades significantly. In this work, we introduce a new benchmark for exploring this problem (also commonly referred to as multi-attribute leakage), and propose an image editing technique, tailored for mitigating it in the case of multiple colors. Our method consistently outperforms existing editing approaches such as AnySD, FPE and MasaCtrl.
Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors--- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes—far more so than with single-color prompts—and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques. We will make our code, benchmark and evaluation protocol publicly available.
We constructed the CompColor benchmark to evaluate color fidelity in text-to-image models.
Our benchmark focus on complex multi-object prompts, because it remains a question whether current
state-of-the-art models can adhere to it, we focus on colors.
We build our benchmark by creating pairs
of colors for prompts structured as “a {color1} colored {object1} and a {color2} colored {object2}.”
We distinguish between close and distant colors based on their perceptual
similarity in the CIELAB color space.
We introduce ColorEdit, an inference-time approach that utilizes attention-based diffusion models for editing both real and generated images to
match color specifications.
Our goal is to edit the image Is so that it matches the color specification as closely as possible, while preserving its overall appearance
and structure and still producing a high quality image.
Given an input image Is and a target prompt P containing multiple color attributes
we present an approach for editing the image to match the color specification while preserving all other attributes,
our approach operates as follow:
🌍 We perform DDIM inversion to get the latents ZT.
💡 In the upper branch we perform backward process using simplified color-less text prompt Psimp and extract the pseudo-GT cross-attention map of each object.
🔍 Finally, in the lower branch we perform another backward process this time using the target prompt P, at each step we utilize
two objectives to guide our inference-time optimization procedure:
@misc{chai2025giftcomesgoldpaper,
title={Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models},
author={Shay Shomer Chai and Wenxuan Peng and Bharath Hariharan and Hadar Averbuch-Elor},
year={2025},
eprint={2508.19791},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.19791},
}