The recently-proposed Perceiver model obtains good results on several domains(images, audio, multimodal, point clouds) while scaling linearly in compute andmemory with the input size. While the Perceiver supports many kinds of inputs,it can only produce very simple outputs such as class scores. Perceiver IOovercomes this limitation without sacrificing the original's appealingproperties by learning to flexibly query the model's latent space to produceoutputs of arbitrary size and semantics. Perceiver IO still decouples modeldepth from data size and still scales linearly with data size, but now withrespect to both input and output sizes. The full Perceiver IO model achievesstrong results on tasks with highly structured output spaces, such as naturallanguage and visual understanding, StarCraft II, and multi-task and multi-modaldomains. As highlights, Perceiver IO matches a Transformer-based BERT baselineon the GLUE language benchmark without the need for input tokenization andachieves state-of-the-art performance on Sintel optical flow estimation.