Abstract
Offline reinforcement learning (RL) enables learning effective policies fromfixed datasets without any environment interaction. Existing methods typicallyemploy policy constraints to mitigate the distribution shift encountered duringoffline RL training. However, because the scale of the constraints variesacross tasks and datasets of differing quality, existing methods mustmeticulously tune hyperparameters to match each dataset, which istime-consuming and often impractical. We propose Adaptive Scaling of PolicyConstraints (ASPC), a second-order differentiable framework that dynamicallybalances RL and behavior cloning (BC) during training. We theoretically analyzeits performance improvement guarantee. In experiments on 39 datasets acrossfour D4RL domains, ASPC using a single hyperparameter configuration outperformsother adaptive constraint methods and state-of-the-art offline RL algorithmsthat require per-dataset tuning while incurring only minimal computationaloverhead. The code will be released at https://github.com/Colin-Jing/ASPC.