Abstract
Lip reading has received increasing attention in recent years. This paperfocuses on the synergy of multilingual lip reading. There are more than 7,000languages in the world, which implies that it is impractical to train separatelip reading models by collecting large-scale data per language. Although eachlanguage has its own linguistic and pronunciation features, the lip movementsof all languages share similar patterns. Based on this idea, in this paper, wetry to explore the synergized learning of multilingual lip reading, and furtherpropose a synchronous bidirectional learning(SBL) framework for effectivesynergy of multilingual lip reading. Firstly, we introduce the phonemes as ourmodeling units for the multilingual setting. Similar phoneme always leads tosimilar visual patterns. The multilingual setting would increase both thequantity and the diversity of each phoneme shared among different languages. Sothe learning for the multilingual target should bring improvement to theprediction of phonemes. Then, a SBL block is proposed to infer the target unitwhen given its previous and later context. The rules for each specific languagewhich the model itself judges to be is learned in this fill-in-the-blankmanner. To make the learning process more targeted at each particular language,we introduce an extra task of predicting the language identity in the learningprocess. Finally, we perform a thorough comparison on LRW (English) andLRW-1000(Mandarin). The results outperform the existing state of the art by alarge margin, and show the promising benefits from the synergized learning ofdifferent languages.