An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

Abstract

We introduce Iterative Dual Reinforcement Learning (IDRL), a new method thattakes an optimal discriminator-weighted imitation view of solving RL. Ourmethod is motivated by a simple experiment in which we find training adiscriminator using the offline dataset plus an additional expert dataset andthen performing discriminator-weighted behavior cloning gives strong results onvarious types of datasets. That optimal discriminator weight is quite similarto the learned visitation distribution ratio in Dual-RL, however, we find thatcurrent Dual-RL methods do not correctly estimate that ratio. In IDRL, wepropose a correction method to iteratively approach the optimal visitationdistribution ratio in the offline dataset given no addtional expert dataset.During each iteration, IDRL removes zero-weight suboptimal transitions usingthe learned ratio from the previous iteration and runs Dual-RL on the remainingsubdataset. This can be seen as replacing the behavior visitation distributionwith the optimized visitation distribution from the previous iteration, whichtheoretically gives a curriculum of improved visitation distribution ratiosthat are closer to the optimal discriminator weight. We verify theeffectiveness of IDRL on various kinds of offline datasets, including D4RLdatasets and more realistic corrupted demonstrations. IDRL beats strongPrimal-RL and Dual-RL baselines in terms of both performance and stability, onall datasets.

Quick Read (beta)

loading the full paper ...