Large-Scale Visual Speech Recognition

Abstract

This work presents a scalable solution to open-vocabulary visual speechrecognition. To achieve this, we constructed the largest existing visual speechrecognition dataset, consisting of pairs of text and video clips of facesspeaking (3,886 hours of video). In tandem, we designed and trained anintegrated lipreading system, consisting of a video processing pipeline thatmaps raw video to stable videos of lips and sequences of phonemes, a scalabledeep neural network that maps the lip videos to sequences of phonemedistributions, and a production-level speech decoder that outputs sequences ofwords. The proposed system achieves a word error rate (WER) of 40.9% asmeasured on a held-out set. In comparison, professional lipreaders achieveeither 86.4% or 92.9% WER on the same dataset when having access to additionaltypes of contextual information. Our approach significantly improves on otherlipreading approaches, including variants of LipNet and of Watch, Attend, andSpell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

Quick Read (beta)

loading the full paper ...