Abstract
We propose UniT, a Unified Transformer model to simultaneously learn the mostprominent tasks across different domains, ranging from object detection tolanguage understanding and multimodal reasoning. Based on the transformerencoder-decoder architecture, our UniT model encodes each input modality withan encoder and makes predictions on each task with a shared decoder over theencoded input representations, followed by task-specific output heads. Theentire model is jointly trained end-to-end with losses from each task. Comparedto previous efforts on multi-task learning with transformers, we share the samemodel parameters to all tasks instead of separately fine-tuning task-specificmodels and handle a much higher variety of tasks across different domains. Inour experiments, we learn 7 tasks jointly over 8 datasets, achieving comparableperformance to well-established prior work on each domain under the samesupervision with a compact set of model parameters. Code will be released inMMF at https://mmf.sh.