Transformers have revolutionized vision and natural language processing withtheir ability to scale with large datasets. But in robotic manipulation, datais both limited and expensive. Can we still benefit from Transformers with theright problem formulation? We investigate this question with PerAct, alanguage-conditioned behavior-cloning agent for multi-task 6-DoF manipulation.PerAct encodes language goals and RGB-D voxel observations with a PerceiverTransformer, and outputs discretized actions by "detecting the next best voxelaction". Unlike frameworks that operate on 2D images, the voxelized observationand action space provides a strong structural prior for efficiently learning6-DoF policies. With this formulation, we train a single multi-task Transformerfor 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18variations) from just a few demonstrations per task. Our results show thatPerAct significantly outperforms unstructured image-to-action agents and 3DConvNet baselines for a wide range of tabletop tasks.