Abstract
Simplicity is a critical inductive bias for designing data-drivencontrollers, especially when robustness is important. Despite the impressiveresults of deep reinforcement learning in complex control tasks, it is prone tocapturing intricate and spurious correlations between observations and actions,leading to failure under slight perturbations to the environment. To tacklethis problem, in this work we introduce a novel inductive bias towards simplepolicies in reinforcement learning. The simplicity inductive bias is introducedby minimizing the entropy of entire action trajectories, corresponding to thenumber of bits required to describe information in action trajectories afterthe agent observes state trajectories. Our reinforcement learning agent,Trajectory Entropy Reinforcement Learning, is optimized to minimize thetrajectory entropy while maximizing rewards. We show that the trajectoryentropy can be effectively estimated by learning a variational parameterizedaction prediction model, and use the prediction model to construct aninformation-regularized reward function. Furthermore, we construct a practicalalgorithm that enables the joint optimization of models, including the policyand the prediction model. Experimental evaluations on several high-dimensionallocomotion tasks show that our learned policies produce more cyclical andconsistent action trajectories, and achieve superior performance, androbustness to noise and dynamic changes than the state-of-the-art.