With Little Power Comes Great Responsibility

Abstract

Despite its importance to experimental design, statistical power (theprobability that, given a real effect, an experiment will reject the nullhypothesis) has largely been ignored by the NLP community. Underpoweredexperiments make it more difficult to discern the difference betweenstatistical noise and meaningful model improvements, and increase the chancesof exaggerated findings. By meta-analyzing a set of existing NLP papers anddatasets, we characterize typical power for a variety of settings and concludethat underpowered experiments are common in the NLP literature. In particular,for several tasks in the popular GLUE benchmark, small test sets mean that mostattempted comparisons to state of the art models will not be adequatelypowered. Similarly, based on reasonable assumptions, we find that the mosttypical experimental design for human rating studies will be underpowered todetect small model differences, of the sort that are frequently studied. Formachine translation, we find that typical test sets of 2000 sentences haveapproximately 75% power to detect differences of 1 BLEU point. To improve thesituation going forward, we give an overview of best practices for poweranalysis in NLP and release a series of notebooks to assist with future poweranalyses.

Quick Read (beta)

loading the full paper ...