Go Wider Instead of Deeper

Abstract

The transformer has recently achieved impressive results on various tasks. Tofurther improve the effectiveness and efficiency of the transformer, there aretwo trains of thought among existing works: (1) going wider by scaling to moretrainable parameters; (2) going shallower by parameter sharing or modelcompressing along with the depth. However, larger models usually do not scalewell when fewer tokens are available to train, and advanced parallelisms arerequired when the model is extremely large. Smaller models usually achieveinferior performance compared to the original transformer model due to the lossof representation power. In this paper, to achieve better performance withfewer trainable parameters, we propose a framework to deploy trainableparameters efficiently, by going wider instead of deeper. Specially, we scalealong model width by replacing feed-forward network (FFN) withmixture-of-experts (MoE). We then share the MoE layers across transformerblocks using individual layer normalization. Such deployment plays the role totransform various semantic representations, which makes the model moreparameter-efficient and effective. To evaluate our framework, we design WideNetand evaluate it on ImageNet-1K. Our best model outperforms Vision Transformer(ViT) by $1.46\%$ with $0.72 \times$ trainable parameters. Using $0.46 \times$and $0.13 \times$ parameters, our WideNet can still surpass ViT and ViT-MoE by$0.83\%$ and $2.08\%$, respectively.

Quick Read (beta)

loading the full paper ...