TuringAdvice: A Generative and Dynamic Evaluation of Language Use

Abstract

We propose TuringAdvice, a new challenge task and dataset for languageunderstanding models. Given a written situation that a real person is currentlyfacing, a model must generate helpful advice in natural language. Ourevaluation framework tests a fundamental aspect of human languageunderstanding: our ability to use language to resolve open-ended situations bycommunicating with each other. Empirical results show that today's models struggle at TuringAdvice, evenmultibillion parameter models finetuned on 600k in-domain training examples.The best model, a finetuned T5, writes advice that is at least as helpful ashuman-written advice in only 14% of cases; a much larger non-finetunable GPT3model does even worse at 4%. This low performance reveals languageunderstanding errors that are hard to spot outside of a generative setting,showing much room for progress.

Quick Read (beta)

loading the full paper ...