Building and Aligning Comparable Corpora

Abstract

Comparable corpus is a set of topic aligned documents in multiple languages,which are not necessarily translations of each other. These documents areuseful for multilingual natural language processing when there is no paralleltext available in some domains or languages. In addition, comparable documentsare informative because they can tell what is being said about a topic indifferent languages. In this paper, we present a method to build comparablecorpora from Wikipedia encyclopedia and EURONEWS website in English, French andArabic languages. We further experiment a method to automatically aligncomparable documents using cross-lingual similarity measures. We investigatetwo cross-lingual similarity measures to align comparable documents. The firstmeasure is based on bilingual dictionary, and the second measure is based onLatent Semantic Indexing (LSI). Experiments on several corpora show that theCross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure.Finally, we collect English and Arabic news documents from the BritishBroadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively.Then we use the CL-LSI similarity measure to automatically align comparabledocuments of BBC and JSC. The evaluation of the alignment shows that CL-LSI isnot only able to align cross-lingual documents at the topic level, but also itis able to do this at the event level.

Quick Read (beta)

loading the full paper ...