Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

1University of Science and Technology of China, 
2Kling Team, Kuaishou Technology,  3The University of Hong Kong
*Work done during internship at Kling Team, Kuaishou Technology.
Corresponding authors. Project lead.
  • [2025-12-01] 🌟 We have updated the evaluation results of HunyuanImage-3.0 and Z-Image-Turbo.
  • [2025-11-22] 🌟 We have updated the evaluation results of 🍌 Nano Banana Pro, which achieves a new SOTA across all 12 dimensions by a substantial margin (see 🏆 leaderboard for more details).
  • [2025-10-01] 🌟 We have updated a new arXiv version with clearer descriptions and more comprehensive analyses.
  • [2025-09-20] 🌟 We have updated the evaluation results of Seedream 4.0.
  • [2025-09-08] 🌟 We have released the generated images from the evaluated T2I models in our benchmark to facilitate convenient evaluation with different MLLMs.
  • [2025-09-08] 🌟 We have released the benchmark data and code.
  • [2025-09-03] 🌟 We have released the paper, with data and code to follow in a few days after the company’s review.
Teaser Image
Overview of our T2I-CoReBench. (a) Our benchmark comprehensively covers two fundamental T2I capabilities (i.e., composition and reasoning), further refined into 12 dimensions. (b–e) Our benchmark poses greater challenges to current T2I models, with higher compositional density than DPG-Bench and harder multi-step reasoning than R2I-Bench, enabling clearer performance differentiation across models under real-world complexities. Each image is scored based on the ratio of correctly generated elements.


Abstract

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

1. Benchmark Comparison

Benchmarks Composition Reasoning
Deductive Inductive Abductive
MIMAMRTR LRBRHRPR GRARCRRR
T2I-CompBench
GenEval
GenAI-Bench
DPG-Bench
ConceptMix
TIIF-Bench
LongBench-T2I
PRISM-Bench
UniGenBench
Commonsense-T2I
PhyBench
WISE
T2I-ReasonBench
R2I-Bench
OneIG-Bench
T2I-CoReBench (Ours)

Comparison between our T2I-CoReBench and existing T2I benchmarks. T2I-CoReBench comprehensively covers 12 evaluation dimensions spanning both composition and reasoning scenarios. Legend: high-complexity coverage (visual elements > 5 or one-to-many/many-to-one inference), simple coverage (visual elements ≤ 5 or one-to-one inference), not covered.

2. Examples of Each Dimension

3. Leaderboard

Evaluator:
Models Composition Reasoning Overall
MI MA MR TR Mean LR BR HR PR GR AR CR RR Mean
Diffusion Models
SD-3-Medium59.157.935.49.540.422.121.135.351.037.447.335.027.134.536.5
SD-3.5-Medium59.560.633.110.641.019.920.533.553.733.452.735.622.033.936.3
SD-3.5-Large57.560.032.915.641.522.522.434.252.535.553.042.325.235.937.8
FLUX.1-schnell65.463.147.622.449.625.025.140.964.747.654.039.622.940.043.2
FLUX.1-dev58.660.344.131.148.624.823.036.061.842.457.236.330.339.042.2
PixArt-α40.242.214.23.325.011.611.621.130.422.644.426.720.923.724.1
PixArt-Σ47.249.723.82.830.914.718.326.739.225.744.933.924.328.529.3
HiDream-I162.562.042.933.950.334.224.540.953.234.250.346.131.739.443.0
Z-Image-Turbo79.572.262.783.974.636.928.848.774.056.255.852.026.047.356.4
HunyuanImage-3.084.981.263.785.778.939.632.851.472.454.154.157.027.748.658.7
Autoregressive Models
Infinity-8B63.963.447.510.846.428.625.942.962.647.359.246.924.642.343.6
GoT-R1-7B48.855.632.96.135.822.119.231.349.234.846.232.114.631.232.7
Unified Models
BAGEL64.965.245.89.746.423.421.933.051.631.250.432.429.334.138.2
BAGEL w/ Think57.760.837.82.239.625.525.433.958.653.556.941.639.841.941.1
show-o2-1.5B59.560.336.14.640.121.621.837.147.739.944.729.024.033.235.5
show-o2-7B59.461.838.12.240.423.223.137.551.640.947.232.221.334.636.5
Janus-Pro-1B51.054.533.82.935.512.918.124.713.47.115.16.76.413.020.5
Janus-Pro-7B54.459.340.97.540.519.820.934.622.411.530.48.79.819.826.7
BLIP3o-4B45.647.520.30.528.514.217.726.336.337.637.831.324.828.228.3
BLIP3o-8B46.250.424.10.530.314.820.728.339.643.451.035.920.431.831.3
OmniGen2-7B67.964.148.319.249.924.723.243.363.146.154.236.524.139.442.9
Closed-Source Models
Seedream 3.079.978.063.747.667.336.833.650.375.154.961.759.131.250.356.0
Seedream 4.091.584.575.093.686.176.354.160.785.885.977.171.647.969.975.3
Gemini 2.0 Flash67.568.549.762.962.139.339.747.969.358.563.751.239.951.254.8
Nano Banana Pro91.985.583.498.189.790.873.677.890.790.484.277.076.782.785.0
Imagen 482.874.366.390.278.444.551.856.882.879.573.372.865.365.970.0
Imagen 4 Ultra90.080.073.286.282.463.662.466.188.582.883.076.360.772.976.1

Main results on our T2I-CoReBench assessing both composition and reasoning capabilities evaluated by Gemini 2.5 Flash. Mean denotes the mean score for each capability, and Overall summarizes the aggregated score across all dimensions. The best and second-best results are marked in bold and underline for open- and closed-models, respectively. Since most closed-source models enable Prompt Rewriting by default, we also compare the results with Prompt Rewriting enabled (blue rows, rewritten using OpenAI o3) to ensure a fair comparison between open- and closed-models

BibTeX

@article{li2025easier,
  title={Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?},
  author={Li, Ouxiang and Wang, Yuan and Hu, Xinting and Huang, Huijuan and Chen, Rui and Ou, Jiarong and Tao, Xin and Wan, Pengfei and Feng, Fuli},
  journal={arXiv preprint arXiv:2509.03516},
  year={2025}
}