Sequencer: Deep LSTM for Image Classification

Abstract

In recent computer vision research, the advent of the Vision Transformer(ViT) has rapidly revolutionized various architectural design efforts: ViTachieved state-of-the-art image classification performance using self-attentionfound in natural language processing, and MLP-Mixer achieved competitiveperformance using simple multi-layer perceptrons. In contrast, several studieshave also suggested that carefully redesigned convolutional neural networks(CNNs) can achieve advanced performance comparable to ViT without resorting tothese new ideas. Against this background, there is growing interest in whatinductive bias is suitable for computer vision. Here we propose Sequencer, anovel and competitive architecture alternative to ViT that provides a newperspective on these issues. Unlike ViTs, Sequencer models long-rangedependencies using LSTMs rather than self-attention layers. We also propose atwo-dimensional version of Sequencer module, where an LSTM is decomposed intovertical and horizontal LSTMs to enhance performance. Despite its simplicity,several experiments demonstrate that Sequencer performs impressively well:Sequencer2D-L, with 54M parameters, realizes 84.6\% top-1 accuracy on onlyImageNet-1K. Not only that, we show that it has good transferability and therobust resolution adaptability on double resolution-band.

Quick Read (beta)

loading the full paper ...