On-Device Language Identification of Text in Images using Diacritic Characters

Abstract

Diacritic characters can be considered as a unique set of charactersproviding us with adequate and significant clue in identifying a given languagewith considerably high accuracy. Diacritics, though associated with phoneticsoften serve as a distinguishing feature for many languages especially the oneswith a Latin script. In this proposed work, we aim to identify language of textin images using the presence of diacritic characters in order to improveOptical Character Recognition (OCR) performance in any given automatedenvironment. We showcase our work across 13 Latin languages encompassing 85diacritic characters. We use an architecture similar to Squeezedet for objectdetection of diacritic characters followed by a shallow network to finallyidentify the language. OCR systems when accompanied with identified languageparameter tends to produce better results than sole deployment of OCR systems.The discussed work apart from guaranteeing an improvement in OCR results alsotakes on-device (mobile phone) constraints into consideration in terms of modelsize and inference time.

Quick Read (beta)

loading the full paper ...