Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Polylingual Text Classification

  • 2019-01-31 16:32:08
  • Andrea Esuli, Alejandro Moreo, Fabrizio Sebastiani
  • 1


Polylingual Text Classification (PLC) consists of automatically classifying,according to a common set C of classes, documents each written in one of a setof languages L, and doing so more accurately than when naively classifying eachdocument via its corresponding language-specific classifier. In order to obtainan increase in the classification accuracy for a given language, the systemthus needs to also leverage the training examples written in the otherlanguages. We tackle multilabel PLC via funnelling, a new ensemble learningmethod that we propose here. Funnelling consists of generating a two-tierclassification system where all documents, irrespectively of language, areclassified by the same (2nd-tier) classifier. For this classifier all documentsare represented in a common, language-independent feature space consisting ofthe posterior probabilities generated by 1st-tier, language-dependentclassifiers. This allows the classification of all test documents, of anylanguage, to benefit from the information present in all training documents, ofany language. We present substantial experiments, run on publicly availablepolylingual text collections, in which funnelling is shown to significantlyoutperform a number of state-of-the-art baselines. All code and datasets (invector form) are made publicly available.


