Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

Abstract

Human lip-reading is a challenging task. It requires not only knowledge ofunderlying language but also visual clues to predict spoken words. Experts needcertain level of experience and understanding of visual expressions learning todecode spoken words. Now-a-days, with the help of deep learning it is possibleto translate lip sequences into meaningful words. The speech recognition in thenoisy environments can be increased with the visual information [1]. Todemonstrate this, in this project, we have tried to train two differentdeep-learning models for lip-reading: first one for video sequences usingspatiotemporal convolution neural network, Bi-gated recurrent neural networkand Connectionist Temporal Classification Loss, and second for audio thatinputs the MFCC features to a layer of LSTM cells and output the sequence. Wehave also collected a small audio-visual dataset to train and test our model.Our target is to integrate our both models to improve the speech recognition inthe noisy environment

Quick Read (beta)

loading the full paper ...