Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

Abstract

We provide a new multi-task benchmark for evaluating text-to-image models. Weperform a human evaluation comparing the most common open-source (StableDiffusion) and commercial (DALL-E 2) models. Twenty computer science AIgraduate students evaluated the two models, on three tasks, at three difficultylevels, across ten prompts each, providing 3,600 ratings. Text-to-imagegeneration has seen rapid progress to the point that many recent models havedemonstrated their ability to create realistic high-resolution images forvarious prompts. However, current text-to-image methods and the broader body ofresearch in vision-language understanding still struggle with intricate textprompts that contain many objects with multiple attributes and relationships.We introduce a new text-to-image benchmark that contains a suite of thirty-twotasks over multiple applications that capture a model's ability to handledifferent features of a text prompt. For example, asking a model to generate avarying number of the same object to measure its ability to count or providinga text prompt with several objects that each have a different attribute toidentify its ability to match objects and attributes correctly. Rather thansubjectively evaluating text-to-image results on a set of prompts, our newmulti-task benchmark consists of challenge tasks at three difficulty levels(easy, medium, and hard) and human ratings for each generated image.

Quick Read (beta)

loading the full paper ...