Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

  • 2021-12-02 18:59:50
  • Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai
Biological intelligence systems of animals perceive the world by integratinginformation in different modalities and processing simultaneously for varioustasks. In contrast, current machine learning research follows a task-specificparadigm, leading to inefficient collaboration between tasks and high marginalcosts of developing perception models for new tasks. In this paper, we presenta generic perception architecture named Uni-Perceiver, which processes avariety of modalities and tasks with unified modeling and shared parameters.Specifically, Uni-Perceiver encodes different task inputs and targets fromarbitrary modalities into a unified representation space with amodality-agnostic Transformer encoder and lightweight modality-specifictokenizers. Different perception tasks are modeled as the same formulation,that is, finding the maximum likelihood target for each input through thesimilarity of their representations. The model is pre-trained on severaluni-modal and multi-modal tasks, and evaluated on a variety of downstreamtasks, including novel tasks that did not appear in the pre-training stage.Results show that our pre-trained model without any tuning can achievereasonable performance even on novel tasks. The performance can be improved toa level close to state-of-the-art methods by conducting prompt tuning on 1% ofdownstream task data. Full-data fine-tuning further delivers results on parwith or better than state-of-the-art results. Code shall be released.


