Hierarchical Reinforcement Learning for Open-Domain Dialog

  • 2019-09-17 01:57:18
  • Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Rosalind Picard
  • 1


Open-domain dialog generation is a challenging problem; maximum likelihoodtraining can lead to repetitive outputs, models have difficulty trackinglong-term conversational goals, and training on standard movie or onlinedatasets may lead to the generation of inappropriate, biased, or offensivetext. Reinforcement Learning (RL) is a powerful framework that couldpotentially address these issues, for example by allowing a dialog model tooptimize for reducing toxicity and repetitiveness. However, previous approacheswhich apply RL to open-domain dialog generation do so at the word level, makingit difficult for the model to learn proper credit assignment for long-termconversational rewards. In this paper, we propose a novel approach tohierarchical reinforcement learning, VHRL, which uses policy gradients to tunethe utterance-level embedding of a variational sequence model. Thishierarchical approach provides greater flexibility for learning long-term,conversational rewards. We use self-play and RL to optimize for a set ofhuman-centered conversation metrics, and show that our approach providessignificant improvements -- in terms of both human evaluation and automaticmetrics -- over state-of-the-art dialog models, including Transformers.


Quick Read (beta)

loading the full paper ...