Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Abstract

This work presents a large-scale audio-visual speech recognition system basedon a recurrent neural network transducer (RNN-T) architecture. To support thedevelopment of such a system, we built a large audio-visual (A/V) dataset ofsegmented utterances extracted from YouTube public videos, leading to 31k hoursof audio-visual training content. The performance of an audio-only,visual-only, and audio-visual system are compared on two large-vocabulary testsets: a set of utterance segments from public YouTube videos called YTDEV18 andthe publicly available LRS3-TED set. To highlight the contribution of thevisual modality, we also evaluated the performance of our system on the YTDEV18set artificially corrupted with background noise and overlapping speech. To thebest of our knowledge, our system significantly improves the state-of-the-arton the LRS3-TED set.

Quick Read (beta)

loading the full paper ...