Strong Baselines for Complex Word Identification across Multiple Languages

Abstract

Complex Word Identification (CWI) is the task of identifying which words orphrases in a sentence are difficult to understand by a target audience. Thelatest CWI Shared Task released data for two settings: monolingual (i.e. trainand test in the same language) and cross-lingual (i.e. test in a language notseen during training). The best monolingual models relied on language-dependentfeatures, which do not generalise in the cross-lingual setting, while the bestcross-lingual model used neural networks with multi-task learning. In thispaper, we present monolingual and cross-lingual CWI models that perform as wellas (or better than) most models submitted to the latest CWI Shared Task. Weshow that carefully selected features and simple learning models can achievestate-of-the-art performance, and result in strong baselines for futuredevelopment in this area. Finally, we discuss how inconsistencies in theannotation of the data can explain some of the results obtained.

Quick Read (beta)

loading the full paper ...