FAST: Efficient Action Tokenization for Vision-Language-Action Models

Abstract

Autoregressive sequence models, such as Transformer-based vision-languageaction (VLA) policies, can be tremendously effective for capturing complex andgeneralizable robotic behaviors. However, such models require us to choose atokenization of our continuous action signals, which determines how thediscrete symbols predicted by the model map to continuous robot actions. Wefind that current approaches for robot action tokenization, based on simpleper-dimension, per-timestep binning schemes, typically perform poorly whenlearning dexterous skills from high-frequency robot data. To address thischallenge, we propose a new compression-based tokenization scheme for robotactions, based on the discrete cosine transform. Our tokenization approach,Frequency-space Action Sequence Tokenization (FAST), enables us to trainautoregressive VLAs for highly dexterous and high-frequency tasks wherestandard discretization methods fail completely. Based on FAST, we releaseFAST+, a universal robot action tokenizer, trained on 1M real robot actiontrajectories. It can be used as a black-box tokenizer for a wide range of robotaction sequences, with diverse action spaces and control frequencies. Finally,we show that, when combined with the pi0 VLA, our method can scale to trainingon 10k hours of robot data and match the performance of diffusion VLAs, whilereducing training time by up to 5x.

Quick Read (beta)

loading the full paper ...