- [2025-09-03] 🌟 We have released the paper, with data and code to follow in a few days after the company’s review.
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.
Benchmarks | Composition | Reasoning | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Deductive | Inductive | Abductive | ||||||||||
MI | MA | MR | TR | LR | BR | HR | PR | GR | AR | CR | RR | |
T2I-CompBench | ◐ | ◐ | ◐ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
GenEval | ◐ | ◐ | ◐ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
GenAI-Bench | ◐ | ◐ | ◐ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
DPG-Bench | ● | ● | ● | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
ConceptMix | ◐ | ◐ | ◐ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
TIIF-Bench | ◐ | ◐ | ◐ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
LongBench-T2I | ● | ● | ● | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
Commonsense-T2I | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ◐ | ○ |
PhyBench | ○ | ○ | ○ | ○ | ○ | ◐ | ○ | ○ | ○ | ○ | ◐ | ○ |
WISE | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ○ | ◐ | ○ |
T2I-ReasonBench | ○ | ○ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ○ | ◐ | ○ |
R2I-Bench | ○ | ○ | ◐ | ○ | ◐ | ◐ | ◐ | ○ | ○ | ○ | ◐ | ◐ |
OneIG-Bench | ● | ● | ● | ● | ○ | ○ | ○ | ○ | ○ | ○ | ◐ | ○ |
T2I-CoReBench (Ours) | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● |
Comparison between our T2I-CoReBench and existing T2I benchmarks. T2I-CoReBench comprehensively covers 12 evaluation dimensions spanning both composition and reasoning scenarios. Legend: ● high-complexity coverage (visual elements > 5 or one-to-many/many-to-one inference), ◐ simple coverage (visual elements ≤ 5 or one-to-one inference), ○ not covered.
Models ↕ | Composition | Reasoning | Overall ↕ | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MI ↕ | MA ↕ | MR ↕ | TR ↕ | Mean ↕ | LR ↕ | BR ↕ | HR ↕ | PR ↕ | GR ↕ | AR ↕ | CR ↕ | RR ↕ | Mean ↕ | ||
Diffusion Models | |||||||||||||||
SD-3-Medium | 59.1 | 57.9 | 35.4 | 9.5 | 40.4 | 22.1 | 21.1 | 35.3 | 51.0 | 37.4 | 47.3 | 35.0 | 27.1 | 34.5 | 36.5 |
SD-3.5-Medium | 59.5 | 60.6 | 33.1 | 10.6 | 41.0 | 19.9 | 20.5 | 33.5 | 53.7 | 33.4 | 52.7 | 35.6 | 22.0 | 33.9 | 36.3 |
SD-3.5-Large | 57.5 | 60.0 | 32.9 | 15.6 | 41.5 | 22.5 | 22.4 | 34.2 | 52.5 | 35.5 | 53.0 | 42.3 | 25.2 | 35.9 | 37.8 |
FLUX.1-schnell | 65.4 | 63.1 | 47.6 | 22.4 | 49.6 | 25.0 | 25.1 | 40.9 | 64.7 | 47.6 | 54.0 | 39.6 | 22.9 | 40.0 | 43.2 |
FLUX.1-dev | 58.6 | 60.3 | 44.1 | 31.1 | 48.6 | 24.8 | 23.0 | 36.0 | 61.8 | 42.4 | 57.2 | 36.3 | 30.3 | 39.0 | 42.2 |
FLUX.1-Krea-dev | 70.7 | 71.1 | 53.2 | 28.9 | 56.0 | 30.3 | 26.1 | 44.5 | 70.6 | 50.5 | 57.5 | 46.3 | 28.7 | 44.3 | 48.2 |
PixArt-α | 40.2 | 42.2 | 14.2 | 3.3 | 25.0 | 11.6 | 11.6 | 21.1 | 30.4 | 22.6 | 44.4 | 26.7 | 20.9 | 23.7 | 24.1 |
PixArt-Σ | 47.2 | 49.7 | 23.8 | 2.8 | 30.9 | 14.7 | 18.3 | 26.7 | 39.2 | 25.7 | 44.9 | 33.9 | 24.3 | 28.5 | 29.3 |
HiDream-I1 | 62.5 | 62.0 | 42.9 | 33.9 | 50.3 | 34.2 | 24.5 | 40.9 | 53.2 | 34.2 | 50.3 | 46.1 | 31.7 | 39.4 | 43.0 |
Qwen-Image | 81.4 | 79.6 | 65.6 | 85.5 | 78.0 | 41.1 | 32.2 | 48.2 | 75.1 | 56.5 | 53.3 | 61.9 | 26.4 | 49.3 | 58.9 |
Autoregressive Models | |||||||||||||||
Infinity-8B | 63.9 | 63.4 | 47.5 | 10.8 | 46.4 | 28.6 | 25.9 | 42.9 | 62.6 | 47.3 | 59.2 | 46.9 | 24.6 | 42.3 | 43.6 |
GoT-R1-7B | 48.8 | 55.6 | 32.9 | 6.1 | 35.8 | 22.1 | 19.2 | 31.3 | 49.2 | 34.8 | 46.2 | 32.1 | 14.6 | 31.2 | 32.7 |
Unified Models | |||||||||||||||
BAGEL | 64.9 | 65.2 | 45.8 | 9.7 | 46.4 | 23.4 | 21.9 | 33.0 | 51.6 | 31.2 | 50.4 | 32.4 | 29.3 | 34.1 | 38.2 |
BAGEL w/ Think | 57.7 | 60.8 | 37.8 | 2.2 | 39.6 | 25.5 | 25.4 | 33.9 | 58.6 | 53.5 | 56.9 | 41.6 | 39.8 | 41.9 | 41.1 |
show-o2-1.5B | 59.5 | 60.3 | 36.1 | 4.6 | 40.1 | 21.6 | 21.8 | 37.1 | 47.7 | 39.9 | 44.7 | 29.0 | 24.0 | 33.2 | 35.5 |
show-o2-7B | 59.4 | 61.8 | 38.1 | 2.2 | 40.4 | 23.2 | 23.1 | 37.5 | 51.6 | 40.9 | 47.2 | 32.2 | 21.3 | 34.6 | 36.5 |
Janus-Pro-1B | 51.0 | 54.5 | 33.8 | 2.9 | 35.5 | 12.9 | 18.1 | 24.7 | 13.4 | 7.1 | 15.1 | 6.7 | 6.4 | 13.0 | 20.5 |
Janus-Pro-7B | 54.4 | 59.3 | 40.9 | 7.5 | 40.5 | 19.8 | 20.9 | 34.6 | 22.4 | 11.5 | 30.4 | 8.7 | 9.8 | 19.8 | 26.7 |
BLIP3o-4B | 45.6 | 47.5 | 20.3 | 0.5 | 28.5 | 14.2 | 17.7 | 26.3 | 36.3 | 37.6 | 37.8 | 31.3 | 24.8 | 28.2 | 28.3 |
BLIP3o-8B | 46.2 | 50.4 | 24.1 | 0.5 | 30.3 | 14.8 | 20.7 | 28.3 | 39.6 | 43.4 | 51.0 | 35.9 | 20.4 | 31.8 | 31.3 |
OmniGen2-7B | 67.9 | 64.1 | 48.3 | 19.2 | 49.9 | 24.7 | 23.2 | 43.3 | 63.1 | 46.1 | 54.2 | 36.5 | 24.1 | 39.4 | 42.9 |
Closed-Source Models | |||||||||||||||
Seedream 3.0 | 79.9 | 78.0 | 63.7 | 47.6 | 67.3 | 36.8 | 33.6 | 50.3 | 75.1 | 54.9 | 61.7 | 59.1 | 31.2 | 50.3 | 56.0 |
Gemini 2.0 Flash | 67.5 | 68.5 | 49.7 | 62.9 | 62.1 | 39.3 | 39.7 | 47.9 | 69.3 | 58.5 | 63.7 | 51.2 | 39.9 | 51.2 | 54.8 |
Nano Banana | 77.9 | 85.7 | 72.6 | 86.3 | 80.6 | 64.5 | 64.9 | 67.1 | 85.2 | 84.1 | 83.1 | 71.3 | 68.7 | 73.6 | 75.9 |
Imagen 4 | 82.8 | 74.3 | 66.3 | 90.2 | 78.4 | 44.5 | 51.8 | 56.8 | 82.8 | 79.5 | 73.3 | 72.8 | 65.3 | 65.9 | 70.0 |
Imagen 4 Ultra | 90.0 | 80.0 | 73.2 | 86.2 | 82.4 | 63.6 | 62.4 | 66.1 | 88.5 | 82.8 | 83.0 | 76.3 | 60.7 | 72.9 | 76.1 |
GPT-Image | 84.1 | 75.9 | 72.7 | 86.4 | 79.8 | 59.0 | 54.8 | 65.6 | 87.3 | 76.5 | 82.0 | 70.9 | 56.1 | 69.0 | 72.6 |
Main results on our T2I-CoReBench assessing both composition and reasoning capabilities evaluated by Gemini 2.5 Flash. Mean denotes the mean score for each capability. The best and second-best results are marked in bold and underline for open- and closed-models, respectively.
@misc{li2025easier,
title={Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?},
author={Ouxiang Li and Yuan Wang and Xinting Hu and Huijuan Huang and Rui Chen and Jiarong Ou and Xin Tao and Pengfei Wan and Fuli Feng},
year={2025},
eprint={2509.03516},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.03516},
}