InstanceGen

Image Generation with Instance-level Instructions

1Tel Aviv University, 2Meta AI, 3Cornell University
SIGGRAPH 2025


TL;DR: We introduce an inference time technique which improves diffusion model's ability to generate images for complex prompts involving multiple objects, intance level attributes and spatial relationships *.

* The input to our method is only a text prompt (the words are colored just for illustration purposes!)


Abstract

Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances. Additionally, we contribute CompoundPrompts, a benchmark composed of complex prompts with three difficulty levels in which object instances are progressively compounded with attribute descriptions and spatial relations. Extensive experiments demonstrate that our method significantly surpasses the performance of prior models, particularly over complex multi-object and multi-attribute use cases.


Sample Results



two guitars and a saxophone in a music store, the guitar on the right is a white fender, the one on the left is a rickenbacker bass
results results results
Initial Image
Instance Assignments
Output
four action figures on a kid’s shelf, from left to right they are batman, superman, wolverine and spiderman
results results results
Initial Image
Instance Assignments
Output
a porcupine, a raccoon and a squirrel in the forest, the squirrel is in the middle holding a nut
results results results
Initial Image
Instance Assignments
Output
four pillows stacked on top of eachother, from bottom to top they are pink, tiger pattern, covered in blue velvet and pinstriped
results results results
Initial Image
Instance Assignments
Output
four friends dressed up for Halloween, from right to left they are dressed as a surgeon, a witch, a police officer, and a cowboy
results results results
Initial Image
Instance Assignments
Output


We showcase sample results generated by our method. Results are also displayed along with the initial diffusion model input which acts as a starting point for our method. We also show our "instance assignments" which are derived from the initial  image and guide the generation of the output image.


How does it work?


Overview


🌍 Given a baseline diffusion model, we start by generating an initial (likely incorrect) image while saving cross attention maps and noisy latent codes for downstream tasks.

💡 We then use the initial image along with the cross attention maps to produce an instance segmentation map, which maps all object instances in the original image. The segmentation map and cross attention maps are then summerized in a json format file which details where each object instance is located, how much area it covers, and how likely it is to represent each word in the text prompt.

🔍 The instance layout summary is then given to an LLM, which is tasked with producing a set of "Instance Instructions". These instructions define what each instance region will represent in the output image - either one of the objects defined in the text prompt or the background. We instruct the LLM to produce instance instructions that result in a layout which is both technically correct and aligns with the initial image as best as possible.

📋 The instance instructions are used to guide our "Assignment Conditioned Image Generation" stage, in which we use a set of losses and attention manipulation components to produce an image which accurately represents the instance layout and instructions while maintaining the visual quality of the initial image.


Benchmark

We introduce CompoundPrompts - a new benchmark for evaluating text-to-image model's ability to generate images which accurately depict complex multi-object prompts. CompoundPrompts is made up of 60 unique prompts, each with nine variants defined by three difficulty tiers (A, B, C) and three "total object count" versions (1, 2, 3).

Choose an example, a difficulty tier and an object count version to sample a CompoundPrompts prompt and its InstanceGen result.


Title will appear here

Example

Preview Image

Difficulty Tier

Object count Tier








BibTeX

@misc{sella2025instancegenimagegenerationinstancelevel,
    title={InstanceGen: Image Generation with Instance-level Instructions},
    author={Etai Sella and Yanir Kleiman and Hadar Averbuch-Elor},
    year={2025},
    eprint={2505.05678},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2505.05678},
}

Acknowledgements

We thank Filippos Kokkinos, Eric-Tuan Le and Andrea Vedaldi for their helpful feedback throughout the development of this work.