Unsupervised Stemming based Language Model for Telugu Broadcast News Transcription

Abstract

In Indian Languages , native speakers are able to understand new words formedby either combining or modifying root words with tense and / or gender. Due todata insufficiency, Automatic Speech Recognition system (ASR) may notaccommodate all the words in the language model irrespective of the size of thetext corpus. It also becomes computationally challenging if the volume of thedata increases exponentially due to morphological changes to the root word. Inthis paper a new unsupervised method is proposed for a Indian language: Telugu,based on the unsupervised method for Hindi, to generate the Out of Vocabulary(OOV) words in the language model. By using techniques like smoothing andinterpolation of pre-processed data with supervised and unsupervised stemming,different issues in language model for Indian language: Telugu has beenaddressed. We observe that the smoothing techniques Witten-Bell and Kneser-Neyperform well when compared to other techniques on pre-processed data fromsupervised learning. The ASRs accuracy is improved by 0.76% and 0.94% withsupervised and unsupervised stemming respectively.

Quick Read (beta)

loading the full paper ...