BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominatedvisual-language representation learning in recent years. Current VL modelseither use lightweight uni-modal encoders and learn to extract, align and fuseboth modalities simultaneously in a deep cross-modal encoder, or feed thelast-layer uni-modal representations from the deep pre-trained uni-modalencoders into the top cross-modal encoder. Both approaches potentially restrictvision-language representation learning and limit model performance. In thispaper, we propose Bridge-Tower, which introduces multiple bridge layers thatbuild a connection between the top layers of uni-modal encoders and each layerof the cross-modal encoder. This enables effective bottom-up cross-modalalignment and fusion between visual and textual representations of differentsemantic levels of pre-trained uni-modal encoders in the cross-modal encoder.Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-artperformance on various downstream vision-language tasks. In particular, on theVQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperformingthe previous state-of-the-art model METER by 1.09% with the same pre-trainingdata and almost negligible additional parameters and computational costs.Notably, when further scaling the model, Bridge-Tower achieves an accuracy of81.15%, surpassing models that are pre-trained on orders-of-magnitude largerdatasets. Code and checkpoints are available at\url{https://github.com/microsoft/BridgeTower}.

Quick Read (beta)

loading the full paper ...