Vision Transformers Are Good Mask Auto-Labelers

  • 2023-01-10 18:59:00
  • Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar
  • 41


We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based maskauto-labeling framework for instance segmentation using only box annotations.MAL takes box-cropped images as inputs and conditionally generates their maskpseudo-labels.We show that Vision Transformers are good mask auto-labelers. Ourmethod significantly reduces the gap between auto-labeling and human annotationregarding mask quality. Instance segmentation models trained using theMAL-generated masks can nearly match the performance of their fully-supervisedcounterparts, retaining up to 97.4\% performance of fully supervised models.The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev2017), outperforming state-of-the-art box-supervised methods by significantmargins. Qualitative results indicate that masks produced by MAL are, in somecases, even better than human annotations.


