Abstract
Recent text generation research has increasingly focused on open-endeddomains such as story and poetry generation. Because models built for suchtasks are difficult to evaluate automatically, most researchers in the spacejustify their modeling choices by collecting crowdsourced human judgments oftext quality (e.g., Likert scores of coherence or grammaticality) from AmazonMechanical Turk (AMT). In this paper, we first conduct a survey of 45open-ended text generation papers and find that the vast majority of them failto report crucial details about their AMT tasks, hindering reproducibility. Wethen run a series of story evaluation experiments with both AMT workers andEnglish teachers and discover that even with strict qualification filters, AMTworkers (unlike teachers) fail to distinguish between model-generated text andhuman-generated references. We show that AMT worker judgments improve when theyare shown model-generated output alongside human-generated references, whichenables the workers to better calibrate their ratings. Finally, interviews withthe English teachers provide deeper insights into the challenges of theevaluation process, particularly when rating model-generated text.