Abstract
Offline reinforcement learning (RL) is crucial for real-world applicationswhere exploration can be costly or unsafe. However, offline learned policiesare often suboptimal, and further online fine-tuning is required. In thispaper, we tackle the fundamental dilemma of offline-to-online fine-tuning: ifthe agent remains pessimistic, it may fail to learn a better policy, while ifit becomes optimistic directly, performance may suffer from a sudden drop. Weshow that Bayesian design principles are crucial in solving such a dilemma.Instead of adopting optimistic or pessimistic policies, the agent should act ina way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop whilestill being guaranteed to find the optimal policy. Based on our theoreticalfindings, we introduce a novel algorithm that outperforms existing methods onvarious benchmarks, demonstrating the efficacy of our approach. Overall, theproposed approach provides a new perspective on offline-to-online RL that hasthe potential to enable more effective learning from offline data.