Abstract
Reinforcement learning techniques leveraging deep learning have madetremendous progress in recent years. However, the complexity of neural networksprevents practitioners from understanding their behavior. Decision trees havegained increased attention in supervised learning for their inherentinterpretability, enabling modelers to understand the exact prediction processafter learning. This paper considers the problem of optimizing interpretabledecision tree policies to replace neural networks in reinforcement learningsettings. Previous works have relaxed the tree structure, restricted tooptimizing only tree leaves, or applied imitation learning techniques toapproximately copy the behavior of a neural network policy with a decisiontree. We propose the Decision Tree Policy Optimization (DTPO) algorithm thatdirectly optimizes the complete decision tree using policy gradients. Ourtechnique uses established decision tree heuristics for regression to performpolicy optimization. We empirically show that DTPO is a competitive algorithmcompared to imitation learning algorithms for optimizing decision tree policiesin reinforcement learning.