CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

  • 2025-10-22 14:23:50
  • Daryna Dementieva, Evgeniya Sukhodolskaya, Alexander Fraser
  • 0

Abstract

In the era of social networks and rapid misinformation spread, news analysisremains a critical task. Detecting fake news across multiple languages,particularly beyond English, poses significant challenges. Cross-lingual newscomparison offers a promising approach to verify information by leveragingexternal sources in different languages (Chen and Shu, 2024). However, existingdatasets for cross-lingual news analysis (Chen et al., 2022a) were manuallycurated by journalists and experts, limiting their scalability and adaptabilityto new languages. In this work, we address this gap by introducing a scalable,explainable crowdsourcing pipeline for cross-lingual news similarityassessment. Using this pipeline, we collected a novel dataset CrossNews-UA ofnews pairs in Ukrainian as a central language with linguistically andcontextually relevant languages-Polish, Russian, and English. Each news pair isannotated for semantic similarity with detailed justifications based on the 4Wcriteria (Who, What, Where, When). We further tested a range of models, fromtraditional bag-of-words, Transformer-based architectures to large languagemodels (LLMs). Our results highlight the challenges in multilingual newsanalysis and offer insights into models performance.

 

Quick Read (beta)

loading the full paper ...