Query2Label: A Simple Transformer Way to Multi-Label Classification

Abstract

This paper presents a simple and effective approach to solving themulti-label classification problem. The proposed approach leverages Transformerdecoders to query the existence of a class label. The use of Transformer isrooted in the need of extracting local discriminative features adaptively fordifferent labels, which is a strongly desired property due to the existence ofmultiple objects in one image. The built-in cross-attention module in theTransformer decoder offers an effective way to use label embeddings as queriesto probe and pool class-related features from a feature map computed by avision backbone for subsequent binary classifications. Compared with priorworks, the new framework is simple, using standard Transformers and visionbackbones, and effective, consistently outperforming all previous works on fivemulti-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE,and Visual Genome. Particularly, we establish $91.3\%$ mAP on MS-COCO. We hopeits compact structure, simple implementation, and superior performance serve asa strong baseline for multi-label classification tasks and future studies. Thecode will be available soon at https://github.com/SlongLiu/query2labels.

Quick Read (beta)

loading the full paper ...