Abstract
Learning a goal-oriented dialog policy is generally performed offline withsupervised learning algorithms or online with reinforcement learning (RL).Additionally, as companies accumulate massive quantities of dialog transcriptsbetween customers and trained human agents, encoder-decoder methods have gainedpopularity as agent utterances can be directly treated as supervision withoutthe need for utterance-level annotations. However, one potential drawback ofsuch approaches is that they myopically generate the next agent utterancewithout regard for dialog-level considerations. To resolve this concern, thispaper describes an offline RL method for learning from unannotated corpora thatcan optimize a goal-oriented policy at both the utterance and dialog level. Weintroduce a novel reward function and use both on-policy and off-policy policygradient to learn a policy offline without requiring online user interaction oran explicit state space definition.