floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

Abstract

A hallmark of modern large-scale machine learning techniques is the use oftraining objectives that provide dense supervision to intermediatecomputations, such as teacher forcing the next token in language models ordenoising step-by-step in diffusion models. This enables models to learncomplex functions in a generalizable manner. Motivated by this observation, weinvestigate the benefits of iterative computation for temporal difference (TD)methods in reinforcement learning (RL). Typically they represent valuefunctions in a monolithic fashion, without iterative compute. We introduce floq(flow-matching Q-functions), an approach that parameterizes the Q-functionusing a velocity field and trains it using techniques from flow-matching,typically used in generative modeling. This velocity field underneath the flowis trained using a TD-learning objective, which bootstraps from values producedby a target velocity field, computed by running multiple steps of numericalintegration. Crucially, floq allows for more fine-grained control and scalingof the Q-function capacity than monolithic architectures, by appropriatelysetting the number of integration steps. Across a suite of challenging offlineRL benchmarks and online fine-tuning tasks, floq improves performance by nearly1.8x. floq scales capacity far better than standard TD-learning architectures,highlighting the potential of iterative computation for value learning.

Quick Read (beta)

loading the full paper ...