Abstract
Large Language Models (LLMs) can achieve inflated scores on multiple-choicetasks by exploiting inherent biases in option positions or labels, rather thandemonstrating genuine understanding. This study introduces SCOPE, an evaluationframework designed to measure and mitigate such selection bias in adataset-independent manner. By repeatedly invoking a null prompt that lackssemantic content, SCOPE estimates each model's unique position-biasdistribution. It then redistributes the answer slot according to theinverse-bias distribution, thereby equalizing the lucky-rate, the probabilityof selecting the correct answer by chance. Furthermore, it preventssemantically similar distractors from being placed adjacent to the answer,thereby blocking near-miss guesses based on superficial proximity cues. Acrossmultiple benchmark experiments, SCOPE consistently outperformed existingdebiasing methods in terms of stable performance improvements and showedclearer confidence distributions over correct options. This framework thusoffers a new standard for enhancing the fairness and reliability of LLMevaluations.