Abstract
Multilingual information retrieval (MLIR) considers the problem of rankingdocuments in several languages for a query expressed in a language that maydiffer from any of those languages. Recent work has observed that approachessuch as combining ranked lists representing a single document language each orusing multilingual pretrained language models demonstrate a preference for onelanguage over others. This results in systematic unfair treatment of documentsin different languages. This work proposes a language fairness metric toevaluate whether documents across different languages are fairly ranked throughstatistical equivalence testing using the Kruskal-Wallis test. In contrast tomost prior work in group fairness, we do not consider any language to be anunprotected group. Thus our proposed measure, PEER (Probability ofEqualExpected Rank), is the first fairness metric specifically designed tocapture the language fairness of MLIR systems. We demonstrate the behavior ofPEER on artificial ranked lists. We also evaluate real MLIR systems on twopublicly available benchmarks and show that the PEER scores align with prioranalytical findings on MLIR fairness. Our implementation is compatible withir-measures and is available at http://github.com/hltcoe/peer_measure.