Abstract
Offline reinforcement learning seeks to derive improved policies entirelyfrom historical data but often struggles with over-optimistic value estimatesfor out-of-distribution (OOD) actions. This issue is typically mitigated viapolicy constraint or conservative value regularization methods. However, theseapproaches may impose overly constraints or biased value estimates, potentiallylimiting performance improvements. To balance exploitation and restriction, wepropose an Imagination-Limited Q-learning (ILQ) method, which aims to maintainthe optimism that OOD actions deserve within appropriate limits. Specifically,we utilize the dynamics model to imagine OOD action-values, and then clip theimagined values with the maximum behavior values. Such design maintainsreasonable evaluation of OOD actions to the furthest extent, while avoiding itsover-optimism. Theoretically, we prove the convergence of the proposed ILQunder tabular Markov decision processes. Particularly, we demonstrate that theerror bound between estimated values and optimality values of OOD state-actionspossesses the same magnitude as that of in-distribution ones, therebyindicating that the bias in value estimates is effectively mitigated.Empirically, our method achieves state-of-the-art performance on a wide rangeof tasks in the D4RL benchmark.