TuringAdvice: A Generative and Dynamic Evaluation of Language Use

  • 2021-04-13 01:05:17
  • Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, Yejin Choi
  • 0


We propose TuringAdvice, a new challenge task and dataset for languageunderstanding models. Given a written situation that a real person is currentlyfacing, a model must generate helpful advice in natural language. Ourevaluation framework tests a fundamental aspect of human languageunderstanding: our ability to use language to resolve open-ended situations bycommunicating with each other. Empirical results show that today's models struggle at TuringAdvice, evenmultibillion parameter models finetuned on 600k in-domain training examples.The best model, a finetuned T5, writes advice that is at least as helpful ashuman-written advice in only 14% of cases; a much larger non-finetunable GPT3model does even worse at 4%. This low performance reveals languageunderstanding errors that are hard to spot outside of a generative setting,showing much room for progress.


