Sample Efficient Deep Reinforcement Learning for Dialogue Systems with Large Action Spaces

Abstract

In spoken dialogue systems, we aim to deploy artificial intelligence to buildautomated dialogue agents that can converse with humans. A part of this effortis the policy optimisation task, which attempts to find a policy describing howto respond to humans, in the form of a function taking the current state of thedialogue and returning the response of the system. In this paper, weinvestigate deep reinforcement learning approaches to solve this problem.Particular attention is given to actor-critic methods, off-policy reinforcementlearning with experience replay, and various methods aimed at reducing the biasand variance of estimators. When combined, these methods result in thepreviously proposed ACER algorithm that gave competitive results in gamingenvironments. These environments however are fully observable and have arelatively small action set so in this paper we examine the application of ACERto dialogue policy optimisation. We show that this method beats the currentstate-of-the-art in deep learning approaches for spoken dialogue systems. Thisnot only leads to a more sample efficient algorithm that can train faster, butalso allows us to apply the algorithm in more difficult environments thanbefore. We thus experiment with learning in a very large action space, whichhas two orders of magnitude more actions than previously considered. We findthat ACER trains significantly faster than the current state-of-the-art.

Quick Read (beta)

loading the full paper ...