Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Abstract

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-artAI capabilities. However, recent studies have shown that CoT reasoning is notalways faithful, i.e. CoT reasoning does not always reflect how models arriveat conclusions. So far, most of these studies have focused on unfaithfulness inunnatural contexts where an explicit bias has been introduced. In contrast, weshow that unfaithful CoT can occur on realistic prompts with no artificialbias. Our results reveal concerning rates of several forms of unfaithfulreasoning in frontier models: Sonnet 3.7 (30.6%), DeepSeek R1 (15.8%) andChatGPT-4o (12.6%) all answer a high proportion of question pairs unfaithfully.Specifically, we find that models rationalize their implicit biases in answersto binary questions ("implicit post-hoc rationalization"). For example, whenseparately presented with the questions "Is X bigger than Y?" and "Is Y biggerthan X?", models sometimes produce superficially coherent arguments to justifyanswering Yes to both questions or No to both questions, despite such responsesbeing logically contradictory. We also investigate restoration errors (Dziri etal., 2023), where models make and then silently correct errors in theirreasoning, and unfaithful shortcuts, where models use clearly illogicalreasoning to simplify solving problems in Putnam questions (a hard benchmark).Our findings raise challenges for AI safety work that relies on monitoring CoTto detect undesired behavior.

Quick Read (beta)

loading the full paper ...