Abstract
This paper proposes a powerful Visual Speech Recognition (VSR) method formultiple languages, especially for low-resource languages that have a limitednumber of labeled data. Different from previous methods that tried to improvethe VSR performance for the target language by using knowledge learned fromother languages, we explore whether we can increase the amount of training dataitself for the different languages without human intervention. To this end, weemploy a Whisper model which can conduct both language identification andaudio-based speech recognition. It serves to filter data of the desiredlanguages and transcribe labels from the unannotated, multilingual audio-visualdata pool. By comparing the performances of VSR models trained on automaticlabels and the human-annotated labels, we show that we can achieve similar VSRperformance to that of human-annotated labels even without utilizing humanannotations. Through the automated labeling process, we label large-scaleunlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hoursof data for four low VSR resource languages, French, Italian, Spanish, andPortuguese. With the automatic labels, we achieve new state-of-the-artperformance on mTEDx in four languages, significantly surpassing the previousmethods. The automatic labels are available online:https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages