Abstract
This paper presents a novel training method for end-to-end scene textrecognition. End-to-end scene text recognition offers high recognitionaccuracy, especially when using the encoder-decoder model based on Transformer.To train a highly accurate end-to-end model, we need to prepare a largeimage-to-text paired dataset for the target language. However, it is difficultto collect this data, especially for resource-poor languages. To overcome thisdifficulty, our proposed method utilizes well-prepared large datasets inresource-rich languages such as English, to train the resource-poorencoder-decoder model. Our key idea is to build a model in which the encoderreflects knowledge of multiple languages while the decoder specializes inknowledge of just the resource-poor language. To this end, the proposed methodpre-trains the encoder by using a multilingual dataset that combines theresource-poor language's dataset and the resource-rich language's dataset tolearn language-invariant knowledge for scene text recognition. The proposedmethod also pre-trains the decoder by using the resource-poor language'sdataset to make the decoder better suited to the resource-poor language.Experiments on Japanese scene text recognition using a small, publiclyavailable dataset demonstrate the effectiveness of the proposed method.