Abstract
Value-based reinforcement-learning algorithms are currently state-of-the-artin model-free discrete-action settings, and tend to outperform actor-criticalgorithms. We argue that actor-critic algorithms are currently limited bytheir need for an on-policy critic, which severely constraints how the criticis learned. We propose Bootstrapped Dual Policy Iteration (BDPI), a novelmodel-free actor-critic reinforcement-learning algorithm for continuous statesand discrete actions, with off-policy critics. Off-policy critics arecompatible with experience replay, ensuring high sample-efficiency, without theneed for off-policy corrections. The actor, by slowly imitating the averagegreedy policy of the critics, leads to high-quality and state-specificexploration, which we show approximates Thompson sampling. Because the actorand critics are fully decoupled, BDPI is remarkably stable and, contrary toother state-of-the-art algorithms, unusually forgiving for poorly-configuredhyper-parameters. BDPI is significantly more sample-efficient compared toBootstrapped DQN, PPO, A3C and ACKTR, on a variety of tasks. Source code:https://github.com/vub-ai-lab/bdpi.