Abstract
Understanding long text is of great demands in practice but beyond the reachof most language-image pre-training (LIP) models. In this work, we empiricallyconfirm that the key reason causing such an issue is that the training imagesare usually paired with short captions, leaving certain tokens easilyovershadowed by salient tokens. Towards this problem, our initial attempt is torelabel the data with long captions, however, directly learning with which maylead to performance degradation in understanding short text (e.g., in the imageclassification task). Then, after incorporating corner tokens to aggregatediverse textual information, we manage to help the model catch up to itsoriginal level of short text understanding yet greatly enhance its capabilityof long text understanding. We further look into whether the model cancontinuously benefit from longer captions and notice a clear trade-off betweenthe performance and the efficiency. Finally, we validate the effectiveness ofour approach using a self-constructed large-scale dataset, which consists of100M long caption oriented text-image pairs. It is noteworthy that, on the taskof long-text image retrieval, we beat the competitor using long captions with11.1% improvement (i.e., from 72.62% to 83.72%). We will release the code, themodel, and the new dataset to facilitate the reproducibility and furtherresearch. The project page is available at https://wuw2019.github.io/lot-lip.