Abstract
Although multilingual language models exhibit impressive cross-lingualtransfer capabilities on unseen languages, the performance on downstream tasksis impacted when there is a script disparity with the languages used in themultilingual model's pre-training data. Using transliteration offers astraightforward yet effective means to align the script of a resource-richlanguage with a target language, thereby enhancing cross-lingual transfercapabilities. However, for mixed languages, this approach is suboptimal, sinceonly a subset of the language benefits from the cross-lingual transfer whilethe remainder is impeded. In this work, we focus on Maltese, a Semiticlanguage, with substantial influences from Arabic, Italian, and English, andnotably written in Latin script. We present a novel dataset annotated withword-level etymology. We use this dataset to train a classifier that enables usto make informed decisions regarding the appropriate processing of each tokenin the Maltese language. We contrast indiscriminate transliteration ortranslation to mixing processing pipelines that only transliterate words ofArabic origin, thereby resulting in text with a mixture of scripts. Wefine-tune the processed data on four downstream tasks and show that conditionaltransliteration based on word etymology yields the best results, surpassingfine-tuning with raw Maltese or Maltese processed with non-selective pipelines.