Generating Multilingual Parallel Corpus Using Subtitles

Abstract

Neural Machine Translation with its significant results, still has a greatproblem: lack or absence of parallel corpus for many languages. This articlesuggests a method for generating considerable amount of parallel corpus for anylanguage pairs, extracted from open source materials existing on the Internet.Parallel corpus contents will be derived from video subtitles. It needs a setof video titles, with some attributes like release date, rating, duration andetc. Process of finding and downloading subtitle pairs for desired languagepairs is automated by using a crawler. Finally sentence pairs will be extractedfrom synchronous dialogues in subtitles. The main problem of this method isunsynchronized subtitle pairs. Therefore subtitles will be verified beforedownloading. If two subtitle were not synchronized, then another subtitle ofthat video will be processed till it finds the matching subtitle. Using thisapproach gives ability to make context based parallel corpus through filteringvideos by genre. Context based corpus can be used in complex translators whichdecode sentences by different networks after determining contents subject.Languages have many differences in their formal and informal styles, includingwords and syntax. Other advantage of this method is to make corpus of informalstyle of languages. Because most of movies dialogues are parts of aconversation. So they had informal style. This feature of generated corpus canbe used in real-time translators to have more accurate conversationtranslations.

Quick Read (beta)

loading the full paper ...