Abstract
A language can have different varieties. These varieties can affect theperformance of natural language processing (NLP) models, including largelanguage models (LLMs), which are often trained on data from widely spokenvarieties. This paper introduces a novel and cost-effective approach tobenchmark model performance across language varieties. We argue thatinternational online review platforms, such as Booking.com, can serve aseffective data sources for constructing datasets that capture comments indifferent language varieties from similar real-world scenarios, like reviewsfor the same hotel with the same rating using the same language (e.g., MandarinChinese) but different language varieties (e.g., Taiwan Mandarin, MainlandMandarin). To prove this concept, we constructed a contextually aligned datasetcomprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMsin a sentiment analysis task. Our results show that LLMs consistentlyunderperform in Taiwan Mandarin.