Abstract
Off-policy reinforcement learning algorithms promise to be applicable insettings where only a fixed data-set (batch) of environment interactions isavailable and no new experience can be acquired. This property makes thesealgorithms appealing for real world problems such as robot control. Inpractice, however, standard off-policy algorithms fail in the batch setting forcontinuous control. In this paper, we propose a simple solution to thisproblem. It admits the use of data generated by arbitrary behavior policies anduses a learned prior -- the advantage-weighted behavior model (ABM) -- to biasthe RL policy towards actions that have previously been executed and are likelyto be successful on the new task. Our method can be seen as an extension ofrecent work on batch-RL that enables stable learning from conflictingdata-sources. We find improvements on competitive baselines in a variety of RLtasks -- including standard continuous control benchmarks and multi-tasklearning for simulated and real-world robots.