CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization

Abstract

Offline reinforcement learning (offline RL) algorithms often requireadditional constraints or penalty terms to address distribution shift issues,such as adding implicit or explicit policy constraints during policyoptimization to reduce the estimation bias of functions. This paper focuses ona limitation of the Advantage-Weighted Regression family (AWRs), i.e., thepotential for learning over-conservative policies due to data corruption,specifically the poor explorations in suboptimal offline data. We study it fromtwo perspectives: (1) how poor explorations impact the theoretically optimalpolicy based on KL divergence, and (2) how such poor explorations affect theapproximation of the theoretically optimal policy. We prove that suchover-conservatism is mainly caused by the sensitivity of the loss function forpolicy optimization to poor explorations, and the proportion of poorexplorations in offline datasets. To address this concern, we proposeCorruption-Averse Advantage-Weighted Regression (CAWR), which incorporates aset of robust loss functions during policy optimization and an advantage-basedprioritized experience replay method to filter out poor explorations. Numericalexperiments on the D4RL benchmark show that our method can learn superiorpolicies from suboptimal offline data, significantly enhancing the performanceof policy optimization.

Quick Read (beta)

loading the full paper ...