Sparse-MLP: A Fully-MLP Architecture with Conditional Computation

Abstract

Mixture of Experts (MoE) with sparse conditional computation has been provedan effective architecture for scaling attention-based models to more parameterswith comparable computation cost. In this paper, we propose Sparse-MLP, scalingthe recent MLP-Mixer model with sparse MoE layers, to achieve a morecomputation-efficient architecture. We replace a subset of dense MLP blocks inthe MLP-Mixer model with Sparse blocks. In each Sparse block, we apply twostages of MoE layers: one with MLP experts mixing information within channelsalong image patch dimension, one with MLP experts mixing information withinpatches along the channel dimension. Besides, to reduce computational cost inrouting and improve experts capacity, we design Re-represent layers in eachSparse block. These layers are to re-scale image representations by two simplebut effective linear transformations. By pre-training on ImageNet-1k with MoCov3 algorithm, our models can outperform dense MLP models with comparableparameters and less computational cost on several downstream imageclassification tasks.

Quick Read (beta)

loading the full paper ...