Recent advances in deep learning have resulted in a resurgence in thepopularity of natural language generation (NLG). Many deep learning basedmodels, including recurrent neural networks and generative adversarialnetworks, have been proposed and applied to generating various types of text.Despite the fast development of methods, how to better evaluate the quality ofthese natural language generators remains a significant challenge. We conductan in-depth empirical study to evaluate the existing evaluation methods fornatural language generation. We compare human-based evaluators with a varietyof automated evaluation procedures, including discriminative evaluators thatmeasure how well the generated text can be distinguished from human-writtentext, as well as text overlap metrics that measure how similar the generatedtext is to human-written references. We measure to what extent these differentevaluators agree on the ranking of a dozen of state-of-the-art generators foronline product reviews. We find that human evaluators do not correlate wellwith discriminative evaluators, leaving a bigger question of whetheradversarial accuracy is the correct objective for natural language generation.In general, distinguishing machine-generated text is a challenging task evenfor human evaluators, and their decisions tend to correlate better with textoverlap metrics. We also find that diversity is an intriguing metric that isindicative of the assessments of different evaluators.