XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Abstract

Large Language Models (LLMs) have demonstrated remarkableinstruction-following capabilities across various applications. However, theirperformance in multilingual settings lacks systematic investigation, withexisting evaluations lacking fine-grained constraint analysis across diverselinguistic contexts. We introduce XIFBench, a comprehensive constraint-basedbenchmark for evaluating multilingual instruction-following abilities of LLMs,comprising 558 instructions with 0-5 additional constraints across fivecategories (Content, Style, Situation, Format, and Numerical) in six languagesspanning different resource levels. To support reliable and consistentcross-lingual evaluation, we implement three methodological innovations:cultural accessibility annotation, constraint-level translation validation, andrequirement-based evaluation using English requirements as semantic anchorsacross languages. Extensive experiments with various LLMs not only quantifyperformance disparities across resource levels but also provide detailedinsights into how language resources, constraint categories, instructioncomplexity, and cultural specificity influence multilingualinstruction-following. Our code and data are available athttps://github.com/zhenyuli801/XIFBench.

Quick Read (beta)

loading the full paper ...