Not Every Gift Comes in Gold Paper or with a Red Ribbon:

Exploring Color Perception in Text-to-Image Models

1Tel Aviv University, 2Cornell University

Overview

Text-to-image generation methods such as FLUX (top row) and Stable-Diffusion-2.1 (bottom row) can faithfully render even uncommon colors such as: CornflowerBlue—in simple, single-object prompts. However, when faced with multi-color, multi-object prompts, their performance degrades significantly. In this work, we introduce a new benchmark for exploring this problem (also commonly referred to as multi-attribute leakage), and propose an image editing technique, tailored for mitigating it in the case of multiple colors. Our method consistently outperforms existing editing approaches such as AnySD, FPE and MasaCtrl.


Abstract

Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors--- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes—far more so than with single-color prompts—and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques. We will make our code, benchmark and evaluation protocol publicly available.


Results

Explore our interactive visualization displaying the results of all models on our dataset. Below, you can see the initial images generated by FLUX alongside the edits produced by our method.

"a snow colored umbrella and a royal-blue colored bicycle"

Initial

Edited

"a hot-pink colored suitcase and a tomato colored hat"

Initial

Edited

"a gray colored umbrella and a cyan colored car"

Initial

Edited

"a light-skyblue colored suitcase and a hot-pink colored bench"

Initial

Edited

"a mint-cream colored hat and a light-cyan colored backpack"

Initial

Edited


The CompColor Benchmark

We constructed the CompColor benchmark to evaluate color fidelity in text-to-image models. Our benchmark focus on complex multi-object prompts, because it remains a question whether current state-of-the-art models can adhere to it, we focus on colors. We build our benchmark by creating pairs of colors for prompts structured as “a {color1} colored {object1} and a {color2} colored {object2}.” We distinguish between close and distant colors based on their perceptual similarity in the CIELAB color space.

Overview

Close colors are those that appear visually similar to the human eye, low LAB distance (e.g., SkyBlue vs. LightCyan), while distant colors are distinctly different, high LAB distance (e.g., SkyBlue vs. HotPink).

ColorEdit: Our Inference-Time Color Editing Method

We introduce ColorEdit, an inference-time approach that utilizes attention-based diffusion models for editing both real and generated images to match color specifications. Our goal is to edit the image Is so that it matches the color specification as closely as possible, while preserving its overall appearance and structure and still producing a high quality image.

Overview


Given an input image Is and a target prompt P containing multiple color attributes we present an approach for editing the image to match the color specification while preserving all other attributes, our approach operates as follow:

🌍 We perform DDIM inversion to get the latents ZT.

💡 In the upper branch we perform backward process using simplified color-less text prompt Psimp and extract the pseudo-GT cross-attention map of each object.

🔍 Finally, in the lower branch we perform another backward process this time using the target prompt P, at each step we utilize two objectives to guide our inference-time optimization procedure:

  • Stroop Attention Loss, inspired by the psychological Stroop effect, that binds colors to the right object using the pseudo-GT cross-attention maps of each object.
  • Color Loss that gets the right colors using the reference RGB value.
Our final edited result is illustrated on the bottom right image.


BibTeX

@misc{chai2025giftcomesgoldpaper,
  title={Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models},
  author={Shay Shomer Chai and Wenxuan Peng and Bharath Hariharan and Hadar Averbuch-Elor},
  year={2025},
  eprint={2508.19791},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.19791},
}