An Adversarially-Learned Turing Test for Dialog Generation Models

Abstract

The design of better automated dialogue evaluation metrics offers thepotential of accelerate evaluation research on conversational AI. However,existing trainable dialogue evaluation models are generally restricted toclassifiers trained in a purely supervised manner, which suffer a significantrisk from adversarial attacking (e.g., a nonsensical response that enjoys ahigh classification score). To alleviate this risk, we propose an adversarialtraining approach to learn a robust model, ATT (Adversarial Turing Test), thatdiscriminates machine-generated responses from human-written replies. Incontrast to previous perturbation-based methods, our discriminator is trainedby iteratively generating unrestricted and diverse adversarial examples usingreinforcement learning. The key benefit of this unrestricted adversarialtraining approach is allowing the discriminator to improve robustness in aniterative attack-defense game. Our discriminator shows high accuracy on strongattackers including DialoGPT and GPT-3.

Quick Read (beta)

loading the full paper ...