Abstract
Biological intelligence systems of animals perceive the world by integratinginformation in different modalities and processing simultaneously for varioustasks. In contrast, current machine learning research follows a task-specificparadigm, leading to inefficient collaboration between tasks and high marginalcosts of developing perception models for new tasks. In this paper, we presenta generic perception architecture named Uni-Perceiver, which processes avariety of modalities and tasks with unified modeling and shared parameters.Specifically, Uni-Perceiver encodes different task inputs and targets fromarbitrary modalities into a unified representation space with amodality-agnostic Transformer encoder and lightweight modality-specifictokenizers. Different perception tasks are modeled as the same formulation,that is, finding the maximum likelihood target for each input through thesimilarity of their representations. The model is pre-trained on severaluni-modal and multi-modal tasks, and evaluated on a variety of downstreamtasks, including novel tasks that did not appear in the pre-training stage.Results show that our pre-trained model without any tuning can achievereasonable performance even on novel tasks. The performance can be improved toa level close to state-of-the-art methods by conducting prompt tuning on 1% ofdownstream task data. Full-data fine-tuning further delivers results on parwith or better than state-of-the-art results. Code shall be released.