Abstract
Recently, Vision-Language Pre-training (VLP) techniques have greatlybenefited various vision-language tasks by jointly learning visual and textualrepresentations, which intuitively helps in Optical Character Recognition (OCR)tasks due to the rich visual and textual information in scene text images.However, these methods cannot well cope with OCR tasks because of thedifficulty in both instance-level text encoding and image-text pair acquisition(i.e. images and captured texts in them). This paper presents a weaklysupervised pre-training method, oCLIP, which can acquire effective scene textrepresentations by jointly learning and aligning visual and textualinformation. Our network consists of an image encoder and a character-awaretext encoder that extract visual and textual features, respectively, as well asa visual-textual decoder that models the interaction among textual and visualfeatures for learning effective scene text representations. With the learningof textual features, the pre-trained model can attend texts in images well withcharacter awareness. Besides, these designs enable the learning from weaklyannotated texts (i.e. partial texts in images without text bounding boxes)which mitigates the data annotation constraint greatly. Experiments over theweakly annotated images in ICDAR2019-LSVT show that our pre-trained modelimproves F-score by +2.5\% and +4.8\% while transferring its weights to othertext detection and spotting networks, respectively. In addition, the proposedmethod outperforms existing pre-training techniques consistently acrossmultiple public datasets (e.g., +3.2\% and +1.3\% for Total-Text and CTW1500).