Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

  • 2021-12-02 18:59:50
  • Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai
  • 1


Biological intelligence systems of animals perceive the world by integratinginformation in different modalities and processing simultaneously for varioustasks. In contrast, current machine learning research follows a task-specificparadigm, leading to inefficient collaboration between tasks and high marginalcosts of developing perception models for new tasks. In this paper, we presenta generic perception architecture named Uni-Perceiver, which processes avariety of modalities and tasks with unified modeling and shared parameters.Specifically, Uni-Perceiver encodes different task inputs and targets fromarbitrary modalities into a unified representation space with amodality-agnostic Transformer encoder and lightweight modality-specifictokenizers. Different perception tasks are modeled as the same formulation,that is, finding the maximum likelihood target for each input through thesimilarity of their representations. The model is pre-trained on severaluni-modal and multi-modal tasks, and evaluated on a variety of downstreamtasks, including novel tasks that did not appear in the pre-training stage.Results show that our pre-trained model without any tuning can achievereasonable performance even on novel tasks. The performance can be improved toa level close to state-of-the-art methods by conducting prompt tuning on 1% ofdownstream task data. Full-data fine-tuning further delivers results on parwith or better than state-of-the-art results. Code shall be released.


Quick Read (beta)

loading the full paper ...