Abstract
Multiple-choice benchmarks, consisting of various prompts and choices, areamong the most widely used methods to assess a language model's naturallanguage understanding capability. Given a specific prompt, we typicallycompute $P(Choice|Prompt)$ to evaluate how likely a language model is togenerate the correct choice compared to incorrect ones. However, we observethat performance measured using this approach reflects not only the model'scomprehension of the prompt but also its inherent biases for certain choicesregardless of the prompt. This issue makes it challenging to accurately measurea model's natural language understanding, as models may select the answerwithout fully understanding the prompt. To address this limitation, we proposea novel metric called ANPMI, which normalizes Pointwise Mutual Information(PMI) by $-\log P(Choice)$. ANPMI provides a more accurate assessment of themodel's natural language understanding by ensuring that it is challenging toanswer a question without properly understanding the prompt.