Abstract
Batch reinforcement learning enables policy learning without directinteraction with the environment during training, relying exclusively onpreviously collected sets of interactions. This approach is, therefore,well-suited for high-risk and cost-intensive applications, such as industrialcontrol. Learned policies are commonly restricted to act in a similar fashionas observed in the batch. In a real-world scenario, learned policies aredeployed in the industrial system, inevitably leading to the collection of newdata that can subsequently be added to the existing recording. The process oflearning and deployment can thus take place multiple times throughout thelifespan of a system. In this work, we propose to exploit this iterative natureof applying offline reinforcement learning to guide learned policies towardsefficient and informative data collection during deployment, leading tocontinuous improvement of learned policies while remaining within the supportof collected data. We present an algorithmic methodology for iterative batchreinforcement learning based on ensemble-based model-based policy search,augmented with safety and, importantly, a diversity criterion.