Automatic Generation of Language-Independent Features for Cross-Lingual Classification

Abstract

Many applications require categorization of text documents using predefinedcategories. The main approach to performing text categorization is learningfrom labeled examples. For many tasks, it may be difficult to find examples inone language but easy in others. The problem of learning from examples in oneor more languages and classifying (categorizing) in another is calledcross-lingual learning. In this work, we present a novel approach that solvesthe general cross-lingual text categorization problem. Our method generates,for each training document, a set of language-independent features. Using thesefeatures for training yields a language-independent classifier. At theclassification stage, we generate language-independent features for theunlabeled document, and apply the classifier on the new representation. To build the feature generator, we utilize a hierarchicallanguage-independent ontology, where each concept has a set of supportdocuments for each language involved. In the preprocessing stage, we use thesupport documents to build a set of language-independent feature generators,one for each language. The collection of these generators is used to map anydocument into the language-independent feature space. Our methodology works on the most general cross-lingual text categorizationproblems, being able to learn from any mix of languages and classify documentsin any other language. We also present a method for exploiting the hierarchicalstructure of the ontology to create virtual supporting documents for languagesthat do not have them. We tested our method, using Wikipedia as our ontology,on the most commonly used test collections in cross-lingual textcategorization, and found that it outperforms existing methods.

Quick Read (beta)

loading the full paper ...