Abstract
Traditional benchmarking in NLP typically involves using static held-out testsets. However, this approach often results in an overestimation of performanceand lacks the ability to offer comprehensive, interpretable, and dynamicassessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021)and CheckList (Ribeiro et al., 2020) have addressed these limitations throughbehavioral testing of NLP models with test types generated by a multistephuman-annotated pipeline. Unfortunately, manually creating a variety of testtypes requires much human labor, often at prohibitive cost. In this work, wepropose SYNTHEVAL, a hybrid behavioral testing framework that leverages largelanguage models (LLMs) to generate a wide range of test types for acomprehensive evaluation of NLP models. SYNTHEVAL first generates sentences viaLLMs using controlled generation, and then identifies challenging examples bycomparing the predictions made by LLMs with task-specific NLP models. In thelast stage, human experts investigate the challenging examples, manually designtemplates, and identify the types of failures the taskspecific modelsconsistently exhibit. We apply SYNTHEVAL to two classification tasks, sentimentanalysis and toxic language detection, and show that our framework is effectivein identifying weaknesses of strong models on these tasks. We share our code inhttps://github.com/Loreley99/SynthEval_CheckList.