RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

Abstract

We reveal that feedforward network (FFN) layers, rather than attentionlayers, are the primary contributors to Vision Transformer (ViT) inferencelatency, with their impact signifying as model size increases. This findinghighlights a critical opportunity for optimizing the efficiency of large-scaleViTs by focusing on FFN layers. In this work, we propose a novel channel idlemechanism that facilitates post-training structural reparameterization forefficient FFN layers during testing. Specifically, a set of feature channelsremains idle and bypasses the nonlinear activation function in each FFN layer,thereby forming a linear pathway that enables structural reparameterizationduring inference. This mechanism results in a family of ReParameterizableVision Transformers (RePaViTs), which achieve remarkable latency reductionswith acceptable sacrifices (sometimes gains) in accuracy across various ViTs.The benefits of our method scale consistently with model sizes, demonstratinggreater speed improvements and progressively narrowing accuracy gaps or evenhigher accuracies on larger models. In particular, RePa-ViT-Large andRePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1accuracies under the same training strategy, respectively. RePaViT is the firstto employ structural reparameterization on FFN layers to expedite ViTs to ourbest knowledge, and we believe that it represents an auspicious direction forefficient ViTs. Source code is available athttps://github.com/Ackesnal/RePaViT.

Quick Read (beta)

loading the full paper ...