Offline Reinforcement Learning with Implicit Q-Learning

  • 2021-10-12 17:05:05
  • Ilya Kostrikov, Ashvin Nair, Sergey Levine
  • 6

Abstract

Offline reinforcement learning requires reconciling two conflicting aims:learning a policy that improves over the behavior policy that collected thedataset, while at the same time minimizing the deviation from the behaviorpolicy so as to avoid errors due to distributional shift. This trade-off iscritical, because most current offline reinforcement learning methods need toquery the value of unseen actions during training to improve the policy, andtherefore need to either constrain these actions to be in-distribution, or elseregularize their values. We propose an offline RL method that never needs toevaluate actions outside of the dataset, but still enables the learned policyto improve substantially over the best behavior in the data throughgeneralization. The main insight in our work is that, instead of evaluatingunseen actions from the latest policy, we can approximate the policyimprovement step implicitly by treating the state value function as a randomvariable, with randomness determined by the action (while still integratingover the dynamics to avoid excessive optimism), and then taking a stateconditional upper expectile of this random variable to estimate the value ofthe best actions in that state. This leverages the generalization capacity ofthe function approximator to estimate the value of the best available action ata given state without ever directly querying a Q-function with this unseenaction. Our algorithm alternates between fitting this upper expectile valuefunction and backing it up into a Q-function. Then, we extract the policy viaadvantage-weighted behavioral cloning. We dub our method implicit Q-learning(IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standardbenchmark for offline reinforcement learning. We also demonstrate that IQLachieves strong performance fine-tuning using online interaction after offlineinitialization.

 

Quick Read (beta)

loading the full paper ...