Abstract
Constraint-based offline reinforcement learning (RL) involves policyconstraints or imposing penalties on the value function to mitigateoverestimation errors caused by distributional shift. This paper focuses on alimitation in existing offline RL methods with penalized value function,indicating the potential for underestimation bias due to unnecessary biasintroduced in the value function. To address this concern, we proposeExclusively Penalized Q-learning (EPQ), which reduces estimation bias in thevalue function by selectively penalizing states that are prone to inducingestimation errors. Numerical results show that our method significantly reducesunderestimation bias and improves performance in various offline control taskscompared to other offline RL methods