Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Abstract

Large Language Models (LLMs) can achieve strong performance on many tasks byproducing step-by-step reasoning before giving a final output, often referredto as chain-of-thought reasoning (CoT). It is tempting to interpret these CoTexplanations as the LLM's process for solving a task. However, we find that CoTexplanations can systematically misrepresent the true reason for a model'sprediction. We demonstrate that CoT explanations can be heavily influenced byadding biasing features to model inputs -- e.g., by reordering themultiple-choice options in a few-shot prompt to make the answer always "(A)" --which models systematically fail to mention in their explanations. When we biasmodels toward incorrect answers, they frequently generate CoT explanationssupporting those answers. This causes accuracy to drop by as much as 36% on asuite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAIand Claude 1.0 from Anthropic. On a social-bias task, model explanationsjustify giving answers in line with stereotypes without mentioning theinfluence of these social biases. Our findings indicate that CoT explanationscan be plausible yet misleading, which risks increasing our trust in LLMswithout guaranteeing their safety. CoT is promising for explainability, but ourresults highlight the need for targeted efforts to evaluate and improveexplanation faithfulness.

Quick Read (beta)

loading the full paper ...