To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

Abstract

Reinforcement learning (RL) -- finding the optimal behaviour (also referredto as policy) maximizing the collected long-term cumulative reward -- is amongthe most influential approaches in machine learning with a large number ofsuccessful applications. In several decision problems, however, one faces thepossibility of policy switching -- changing from the current policy to a newone -- which incurs a non-negligible cost (examples include the shifting of thecurrently applied educational technology, modernization of a computing cluster,and the introduction of a new webpage design), and in the decision one islimited to using historical data without the availability for further onlineinteraction. Despite the inevitable importance of this offline learningscenario, to our best knowledge, very little effort has been made to tackle thekey problem of balancing between the gain and the cost of switching in aflexible and principled way. Leveraging ideas from the area of optimaltransport, we initialize the systematic study of policy switching in offlineRL. We establish fundamental properties and design a Net Actor-Critic algorithmfor the proposed novel switching formulation. Numerical experiments demonstratethe efficiency of our approach on multiple benchmarks of the Gymnasium.

Quick Read (beta)

loading the full paper ...