Multi-lingual and Multi-cultural Figurative Language Understanding

Abstract

Figurative language permeates human communication, but at the same time isrelatively understudied in NLP. Datasets have been created in English toaccelerate progress towards measuring and improving figurative languageprocessing in language models (LMs). However, the use of figurative language isan expression of our cultural and societal experiences, making it difficult forthese phrases to be universally applicable. In this work, we create afigurative language inference dataset, \datasetname, for seven diverselanguages associated with a variety of cultures: Hindi, Indonesian, Javanese,Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each languagerelies on cultural and regional concepts for figurative expressions, with thehighest overlap between languages originating from the same region. We assessmultilingual LMs' abilities to interpret figurative language in zero-shot andfew-shot settings. All languages exhibit a significant deficiency compared toEnglish, with variations in performance reflecting the availability ofpre-training and fine-tuning data, emphasizing the need for LMs to be exposedto a broader range of linguistic and cultural variation during training.

Quick Read (beta)

loading the full paper ...