Abstract
Recent advancements in massively multilingual machine translation systemshave significantly enhanced translation accuracy; however, even the bestperforming systems still generate hallucinations, severely impacting usertrust. Detecting hallucinations in Machine Translation (MT) remains a criticalchallenge, particularly since existing methods excel with High-ResourceLanguages (HRLs) but exhibit substantial limitations when applied toLow-Resource Languages (LRLs). This paper evaluates hallucination detectionapproaches using Large Language Models (LLMs) and semantic similarity withinmassively multilingual embeddings. Our study spans 16 language directions,covering HRLs, LRLs, with diverse scripts. We find that the choice of model isessential for performance. On average, for HRLs, Llama3-70B outperforms theprevious state of the art by as much as 0.16 MCC (Matthews CorrelationCoefficient). However, for LRLs we observe that Claude Sonnet outperforms otherLLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs canachieve performance comparable or even better than previously proposed models,despite not being explicitly trained for any machine translation task. However,their advantage is less significant for LRLs.