Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

1University of Science and Technology of China, 
2Kling Team, Kuaishou Technology,  3The University of Hong Kong
*Work done during internship at Kling Team, Kuaishou Technology.
Corresponding authors. Project lead.
  • [2025-10-01] ๐ŸŒŸ We have included the evaluation results using Qwen3-VL-235B-Thinking in our main paper (Table 7), which demonstrates the best human alignment among open-source MLLMs.
  • [2025-10-01] ๐ŸŒŸ We have updated a new arXiv version with clearer descriptions and more comprehensive analyses.
  • [2025-09-20] ๐ŸŒŸ We have updated the evaluation results of Seedream 4.0.
  • [2025-09-08] ๐ŸŒŸ We have released the generated images from the evaluated T2I models in our benchmark to facilitate convenient evaluation with different MLLMs.
  • [2025-09-08] ๐ŸŒŸ We have released the benchmark data and code.
  • [2025-09-03] ๐ŸŒŸ We have released the paper, with data and code to follow in a few days after the companyโ€™s review.
Teaser Image
Overview of our T2I-CoReBench. (a) Our benchmark comprehensively covers two fundamental T2I capabilities (i.e., composition and reasoning), further refined into 12 dimensions. (bโ€“e) Our benchmark poses greater challenges to current T2I models, with higher compositional density than DPG-Bench and harder multi-step reasoning than R2I-Bench, enabling clearer performance differentiation across models under real-world complexities. Each image is scored based on the ratio of correctly generated elements.


Abstract

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

1. Benchmark Comparison

Benchmarks Composition Reasoning
Deductive Inductive Abductive
MIMAMRTR LRBRHRPR GRARCRRR
T2I-CompBench โ— โ— โ— โ—‹ โ—‹ โ—‹ โ—‹ โ—‹ โ—‹ โ—‹ โ—‹ โ—‹
GenEval โ— โ— โ— โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
GenAI-Bench โ— โ— โ— โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
DPG-Bench โ— โ— โ— โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
ConceptMix โ— โ— โ— โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
TIIF-Bench โ— โ— โ— โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
LongBench-T2I โ— โ— โ— โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
PRISM-Bench โ— โ— โ— โ— โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹
UniGenBench โ— โ— โ— โ— โ—โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—โ—‹
Commonsense-T2I โ—‹ โ—‹ โ—‹ โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—โ—‹
PhyBench โ—‹ โ—‹ โ—‹ โ—‹ โ—‹ โ— โ—‹ โ—‹ โ—‹ โ—‹ โ— โ—‹
WISE โ—‹ โ—‹ โ—‹ โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—โ—‹
T2I-ReasonBench โ—‹ โ—‹ โ—‹ โ— โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—โ—‹
R2I-Bench โ—‹ โ—‹ โ— โ—‹ โ— โ— โ— โ—‹ โ—‹ โ—‹ โ— โ—
OneIG-Bench โ— โ— โ— โ— โ—‹โ—‹ โ—‹โ—‹ โ—‹โ—‹ โ—โ—‹
T2I-CoReBench (Ours) โ— โ— โ— โ— โ— โ— โ— โ— โ— โ— โ— โ—

Comparison between our T2I-CoReBench and existing T2I benchmarks. T2I-CoReBench comprehensively covers 12 evaluation dimensions spanning both composition and reasoning scenarios. Legend: โ— high-complexity coverage (visual elements > 5 or one-to-many/many-to-one inference), โ— simple coverage (visual elements โ‰ค 5 or one-to-one inference), โ—‹ not covered.

2. Examples of Each Dimension

3. Leaderboard

Models โ†• Composition Reasoning Overall โ†•
MI โ†• MA โ†• MR โ†• TR โ†• Mean โ†• LR โ†• BR โ†• HR โ†• PR โ†• GR โ†• AR โ†• CR โ†• RR โ†• Mean โ†•
Diffusion Models
SD-3-Medium59.157.935.49.540.422.121.135.351.037.447.335.027.134.536.5
SD-3.5-Medium59.560.633.110.641.019.920.533.553.733.452.735.622.033.936.3
SD-3.5-Large57.560.032.915.641.522.522.434.252.535.553.042.325.235.937.8
FLUX.1-schnell65.463.147.622.449.625.025.140.964.747.654.039.622.940.043.2
FLUX.1-dev58.660.344.131.148.624.823.036.061.842.457.236.330.339.042.2
FLUX.1-Krea-dev70.771.153.228.956.030.326.144.570.650.557.546.328.744.348.2
PixArt-ฮฑ40.242.214.23.325.011.611.621.130.422.644.426.720.923.724.1
PixArt-ฮฃ47.249.723.82.830.914.718.326.739.225.744.933.924.328.529.3
HiDream-I162.562.042.933.950.334.224.540.953.234.250.346.131.739.443.0
Qwen-Image81.479.665.685.578.041.132.248.275.156.553.361.926.449.358.9
Autoregressive Models
Infinity-8B63.963.447.510.846.428.625.942.962.647.359.246.924.642.343.6
GoT-R1-7B48.855.632.96.135.822.119.231.349.234.846.232.114.631.232.7
Unified Models
BAGEL64.965.245.89.746.423.421.933.051.631.250.432.429.334.138.2
BAGEL w/ Think57.760.837.82.239.625.525.433.958.653.556.941.639.841.941.1
show-o2-1.5B59.560.336.14.640.121.621.837.147.739.944.729.024.033.235.5
show-o2-7B59.461.838.12.240.423.223.137.551.640.947.232.221.334.636.5
Janus-Pro-1B51.054.533.82.935.512.918.124.713.47.115.16.76.413.020.5
Janus-Pro-7B54.459.340.97.540.519.820.934.622.411.530.48.79.819.826.7
BLIP3o-4B45.647.520.30.528.514.217.726.336.337.637.831.324.828.228.3
BLIP3o-8B46.250.424.10.530.314.820.728.339.643.451.035.920.431.831.3
OmniGen2-7B67.964.148.319.249.924.723.243.363.146.154.236.524.139.442.9
Closed-Source Models
Seedream 3.0 79.978.063.747.667.3 36.833.650.375.154.961.759.131.250.356.0
Seedream 4.0 91.5 84.5 75.0 93.6 86.1 76.3 54.1 60.7 85.8 85.9 77.1 71.6 47.9 69.9 75.3
Gemini 2.0 Flash 67.568.549.762.962.1 39.339.747.969.358.563.751.239.951.254.8
Nano Banana 85.777.972.686.380.6 64.5 64.9 67.1 85.2 84.1 83.1 71.3 68.7 73.6 75.9
Imagen 4 82.874.366.3 90.2 78.4 44.551.856.8 82.879.573.3 72.8 65.3 65.9 70.0
Imagen 4 Ultra 90.0 80.0 73.2 86.2 82.4 63.6 62.4 66.1 88.5 82.8 83.0 76.3 60.7 72.9 76.1
GPT-Image 84.175.972.786.479.8 59.054.865.6 87.3 76.582.070.956.1 69.0 72.6

Main results on our T2I-CoReBench assessing both composition and reasoning capabilities evaluated by Gemini 2.5 Flash. Mean denotes the mean score for each capability. The best and second-best results are marked in bold and underline for open- and closed-models, respectively.

BibTeX

@article{li2025easier,
  title={Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?},
  author={Li, Ouxiang and Wang, Yuan and Hu, Xinting and Huang, Huijuan and Chen, Rui and Ou, Jiarong and Tao, Xin and Wan, Pengfei and Feng, Fuli},
  journal={arXiv preprint arXiv:2509.03516},
  year={2025}
}