Abstract
Combining deep model-free reinforcement learning with on-line planning is apromising approach to building on the successes of deep RL. On-line planningwith look-ahead trees has proven successful in environments where transitionmodels are known a priori. However, in complex environments where transitionmodels need to be learned from data, the deficiencies of learned models havelimited their utility for planning. To address these challenges, we proposeTreeQN, a differentiable, recursive, tree-structured model that serves as adrop-in replacement for any value function network in deep RL with discreteactions. TreeQN dynamically constructs a tree by recursively applying atransition model in a learned abstract state space and then aggregatingpredicted rewards and state-values using a tree backup to estimate Q-values. Wealso propose ATreeC, an actor-critic variant that augments TreeQN with asoftmax layer to form a stochastic policy network. Both approaches are trainedend-to-end, such that the learned model is optimised for its actual use in thetree. We show that TreeQN and ATreeC outperform n-step DQN and A2C on abox-pushing task, as well as n-step DQN and value prediction networks (Oh etal. 2017) on multiple Atari games. Furthermore, we present ablation studiesthat demonstrate the effect of different auxiliary losses on learningtransition models.