Abstract
Model-based Offline Reinforcement Learning trains policies based on offlinedatasets and model dynamics, without direct real-world environmentinteractions. However, this method is inherently challenged by distributionshift. Previous approaches have primarily focused on tackling this issuedirectly leveraging off-policy mechanisms and heuristic uncertainty in modeldynamics, but they resulted in inconsistent objectives and lacked a unifiedtheoretical foundation. This paper offers a comprehensive analysis thatdisentangles the problem into two key components: model bias and policy shift.We provide both theoretical insights and empirical evidence to demonstrate howthese factors lead to inaccuracies in value function estimation and imposeimplicit restrictions on policy learning. To address these challenges, wederive adjustment terms for model bias and policy shift within a unifiedprobabilistic inference framework. These adjustments are seamlessly integratedinto the vanilla reward function to create a novel Shifts-aware Reward (SAR),aiming at refining value learning and facilitating policy training.Furthermore, we introduce Shifts-aware Model-based Offline ReinforcementLearning (SAMBO-RL), a practical framework that efficiently trains classifiersto approximate the SAR for policy optimization. Empirically, we show that SAReffectively mitigates distribution shift, and SAMBO-RL demonstrates superiorperformance across various benchmarks, underscoring its practical effectivenessand validating our theoretical analysis.