Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

Abstract

Chain-of-thought explanations are widely used to inspect the decision processof large language models (LLMs) and to evaluate the trustworthiness of modeloutputs, making them important for effective collaboration between LLMs andhumans. We demonstrate that preference optimization - a key step in thealignment phase - can inadvertently reduce the faithfulness of theseexplanations. This occurs because the reward model (RM), which guidesalignment, is tasked with optimizing both the expected quality of the responseand the appropriateness of the explanations (e.g., minimizing bias or adheringto safety standards), creating potential conflicts. The RM lacks a mechanism toassess the consistency between the model's internal decision process and thegenerated explanation. Consequently, the LLM may engage in "reward hacking" byproducing a final response that scores highly while giving an explanationtailored to maximize reward rather than accurately reflecting its reasoning. Toaddress this issue, we propose enriching the RM's input with a causalattribution of the prediction, allowing the RM to detect discrepancies betweenthe generated self-explanation and the model's decision process. In controlledsettings, we show that this approach reduces the tendency of the LLM togenerate misleading explanations.

Quick Read (beta)

loading the full paper ...