Abstract
We present flow Q-learning (FQL), a simple and performant offlinereinforcement learning (RL) method that leverages an expressive flow-matchingpolicy to model arbitrarily complex action distributions in data. Training aflow policy with RL is a tricky problem, due to the iterative nature of theaction generation process. We address this challenge by training an expressiveone-step policy with RL, rather than directly guiding an iterative flow policyto maximize values. This way, we can completely avoid unstable recursivebackpropagation, eliminate costly iterative action generation at test time, yetstill mostly maintain expressivity. We experimentally show that FQL leads tostrong performance across 73 challenging state- and pixel-based OGBench andD4RL tasks in offline RL and offline-to-online RL. Project page:https://seohong.me/projects/fql/