Abstract
We present Ko-MuSR, the first benchmark to comprehensively evaluatemultistep, soft reasoning in long Korean narratives while minimizing datacontamination. Built following MuSR, Ko-MuSR features fully Korean narratives,reasoning chains, and multiple-choice questions verified by human annotatorsfor logical consistency and answerability. Evaluations of four large languagemodels -- two multilingual and two Korean-specialized -- show that multilingualmodels outperform Korean-focused ones even in Korean reasoning tasks,indicating cross-lingual generalization of reasoning ability. Carefullydesigned prompting strategies, which combine few-shot examples, reasoningtraces, and task-specific hints, further boost accuracy, approachinghuman-level performance. Ko-MuSR offers a solid foundation for advancing KoreanNLP by enabling systematic evaluation of long-context reasoning and promptingstrategies.