Abstract
Improving multilingual language models capabilities in low-resource languagesis generally difficult due to the scarcity of large-scale data in thoselanguages. In this paper, we relax the reliance on texts in low-resourcelanguages by using multilingual lexicons in pretraining to enhance multilingualcapabilities. Specifically, we focus on zero-shot sentiment analysis tasksacross 34 languages, including 6 high/medium-resource languages, 25low-resource languages, and 3 code-switching datasets. We demonstrate thatpretraining using multilingual lexicons, without using any sentence-levelsentiment data, achieves superior zero-shot performance compared to modelsfine-tuned on English sentiment datasets, and large language models likeGPT--3.5, BLOOMZ, and XGLM. These findings are observable for unseenlow-resource languages to code-mixed scenarios involving high-resourcelanguages.