Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Abstract

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-artAI capabilities. However, recent studies have shown that CoT reasoning is notalways faithful, i.e. CoT reasoning does not always reflect how models arriveat conclusions. So far, most of these studies have focused on unfaithfulness inunnatural contexts where an explicit bias has been introduced. In contrast, weshow that unfaithful CoT can occur on realistic prompts with no artificialbias. Our results reveal non-negligible rates of several forms of unfaithfulreasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) andChatGPT-4o (7.0%) all answer a notable proportion of question pairsunfaithfully. Specifically, we find that models rationalize their implicitbiases in answers to binary questions ("implicit post-hoc rationalization").For example, when separately presented with the questions "Is X bigger than Y?"and "Is Y bigger than X?", models sometimes produce superficially coherentarguments to justify answering Yes to both questions or No to both questions,despite such responses being logically contradictory. We also investigaterestoration errors (Dziri et al., 2023), where models make and then silentlycorrect errors in their reasoning, and unfaithful shortcuts, where models useclearly illogical reasoning to simplify solving problems in Putnam questions (ahard benchmark). Our findings raise challenges for AI safety work that relieson monitoring CoT to detect undesired behavior.

Quick Read (beta)

loading the full paper ...