Abstract
Plain Language Summarization (PLS) aims to distill complex documents intoaccessible summaries for non-expert audiences. In this paper, we conduct athorough survey of PLS literature, and identify that the current standardpractice for readability evaluation is to use traditional readability metrics,such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility inother fields, these metrics have not been compared to human readabilityjudgments in PLS. We evaluate 8 readability metrics and show that mostcorrelate poorly with human judgments, including the most popular metric, FKGL.We then show that Language Models (LMs) are better judges of readability, withthe best-performing model achieving a Pearson correlation of 0.56 with humanjudgments. Extending our analysis to PLS datasets, which contain summariesaimed at non-expert audiences, we find that LMs better capture deeper measuresof readability, such as required background knowledge, and lead to differentconclusions than the traditional metrics. Based on these findings, we offerrecommendations for best practices in the evaluation of plain languagesummaries. We release our analysis code and survey data.