TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval

Abstract

In recent years, considerable progress on the task of text-video retrievalhas been achieved by leveraging large-scale pretraining on visual and audiodatasets to construct powerful video encoders. By contrast, despite the naturalsymmetry, the design of effective algorithms for exploiting large-scalelanguage pretraining remains under-explored. In this work, we are the first toinvestigate the design of such algorithms and propose a novel generalizeddistillation method, TeachText, which leverages complementary cues frommultiple text encoders to provide an enhanced supervisory signal to theretrieval model. Moreover, we extend our method to video side modalities andshow that we can effectively reduce the number of used modalities at test timewithout compromising performance. Our approach advances the state of the art onseveral video retrieval benchmarks by a significant margin and adds nocomputational overhead at test time. Last but not least, we show an effectiveapplication of our method for eliminating noise from retrieval datasets. Codeand data can be found at https://www.robots.ox.ac.uk/~vgg/research/teachtext/.

Quick Read (beta)

loading the full paper ...