Abstract
Recently the vision transformer (ViT) architecture, where the backbone purelyconsists of self-attention mechanism, has achieved very promising performancein visual classification. However, the high performance of the original ViTheavily depends on pretraining using ultra large-scale datasets, and itsignificantly underperforms on ImageNet-1K if trained from scratch. This papermakes the efforts toward addressing this problem, by carefully considering therole of visual tokens. First, for classification head, existing ViT onlyexploits class token while entirely neglecting rich semantic informationinherent in high-level visual tokens. Therefore, we propose a newclassification paradigm, where the second-order, cross-covariance pooling ofvisual tokens is combined with class token for final classification. Meanwhile,a fast singular value power normalization is proposed for improving thesecond-order pooling. Second, the original ViT employs the naive embedding offixed-size image patches, lacking the ability to model translation equivarianceand locality. To alleviate this problem, we develop a light-weight,hierarchical module based on off-the-shelf convolutions for visual tokenembedding. The proposed architecture, which we call So-ViT, is thoroughlyevaluated on ImageNet-1K. The results show our models, when trained fromscratch, outperform the competing ViT variants, while being on par with orbetter than state-of-the-art CNN models. Code is available athttps://github.com/jiangtaoxie/So-ViT