Abstract
Training reinforcement learning (RL) agents on robotic tasks typicallyrequires a large number of training samples. This is because training dataoften consists of noisy trajectories, whether from exploration orhuman-collected demonstrations, making it difficult to learn value functionsthat understand the effect of taking each action. On the other hand, recentbehavior-cloning (BC) approaches have shown that predicting a sequence ofactions enables policies to effectively approximate noisy, multi-modaldistributions of expert demonstrations. Can we use a similar idea for improvingRL on robotic tasks? In this paper, we introduce a novel RL algorithm thatlearns a critic network that outputs Q-values over a sequence of actions. Byexplicitly training the value functions to learn the consequence of executing aseries of current and future actions, our algorithm allows for learning usefulvalue functions from noisy trajectories. We study our algorithm across varioussetups with sparse and dense rewards, and with or without demonstrations,spanning mobile bi-manual manipulation, whole-body control, and tabletopmanipulation tasks from BiGym, HumanoidBench, and RLBench. We find that, bylearning the critic network with action sequences, our algorithm outperformsvarious RL and BC baselines, in particular on challenging humanoid controltasks.