End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient

  • 2017-12-07 19:52:50
  • Li Zhou, Kevin Small, Oleg Rokhlenko, Charles Elkan
  • 4

Abstract

Learning a goal-oriented dialog policy is generally performed offline withsupervised learning algorithms or online with reinforcement learning (RL).Additionally, as companies accumulate massive quantities of dialog transcriptsbetween customers and trained human agents, encoder-decoder methods have gainedpopularity as agent utterances can be directly treated as supervision withoutthe need for utterance-level annotations. However, one potential drawback ofsuch approaches is that they myopically generate the next agent utterancewithout regard for dialog-level considerations. To resolve this concern, thispaper describes an offline RL method for learning from unannotated corpora thatcan optimize a goal-oriented policy at both the utterance and dialog level. Weintroduce a novel reward function and use both on-policy and off-policy policygradient to learn a policy offline without requiring online user interaction oran explicit state space definition.

 

Quick Read (beta)

loading the full paper ...