Abstract
While behavior learning has made impressive progress in recent times, it lagsbehind computer vision and natural language processing due to its inability toleverage large, human-generated datasets. Human behaviors have wide variance,multiple modes, and human demonstrations typically do not come with rewardlabels. These properties limit the applicability of current methods in OfflineRL and Behavioral Cloning to learn from large, pre-collected datasets. In thiswork, we present Behavior Transformer (BeT), a new technique to model unlabeleddemonstration data with multiple modes. BeT retrofits standard transformerarchitectures with action discretization coupled with a multi-task actioncorrection inspired by offset prediction in object detection. This allows us toleverage the multi-modal modeling ability of modern transformers to predictmulti-modal continuous actions. We experimentally evaluate BeT on a variety ofrobotic manipulation and self-driving behavior datasets. We show that BeTsignificantly improves over prior state-of-the-art work on solving demonstratedtasks while capturing the major modes present in the pre-collected datasets.Finally, through an extensive ablation study, we analyze the importance ofevery crucial component in BeT. Videos of behavior generated by BeT areavailable at https://notmahi.github.io/bet