You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

  • 2021-06-01 17:54:09
  • Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu
  • 53

Abstract

Can Transformer perform $2\mathrm{D}$ object-level recognition from a puresequence-to-sequence perspective with minimal knowledge about the $2\mathrm{D}$spatial structure? To answer this question, we present You Only Look at OneSequence (YOLOS), a series of object detection models based on the na\"iveVision Transformer with the fewest possible modifications as well as inductivebiases. We find that YOLOS pre-trained on the mid-sized ImageNet-$1k$ datasetonly can already achieve competitive object detection performance on COCO,\textit{e.g.}, YOLOS-Base directly adopted from BERT-Base can achieve $42.0$box AP. We also discuss the impacts as well as limitations of current pre-trainschemes and model scaling strategies for Transformer in vision through objectdetection. Code and model weights are available at\url{https://github.com/hustvl/YOLOS}.

 

Quick Read (beta)

loading the full paper ...