Abstract
Offline reinforcement learning learns from a static dataset withoutinteracting with environments, which ensures security and thus owns a goodapplication prospect. However, directly applying naive reinforcement learningalgorithm usually fails in an offline environment due to inaccurate Q valueapproximation caused by out-of-distribution (OOD) state-actions. It is aneffective way to solve this problem by penalizing the Q-value of OODstate-actions. Among the methods of punishing OOD state-actions, count-basedmethods have achieved good results in discrete domains in a simple form.Inspired by it, a novel pseudo-count method for continuous domains calledGrid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-basedmethod from discrete to continuous domains. Firstly, the continuous state andaction space are mapped to discrete space using Grid-Mapping, then the Q-valuesof OOD state-actions are constrained through pseudo-count. Secondly, thetheoretical proof is given to show that GPC can obtain appropriate uncertaintyconstraints under fewer assumptions than other pseudo-count methods. Thirdly,GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithmcalled GPC-SAC. Lastly, experiments on D4RL datasets are given to show thatGPC-SAC has better performance and less computational cost than otheralgorithms that constrain the Q-value.