LLM-Generated Black-box Explanations Can Be Adversarially Helpful

Abstract

Large Language Models (LLMs) are becoming vital tools that help us solve andunderstand complex problems by acting as digital assistants. LLMs can generateconvincing explanations, even when only given the inputs and outputs of theseproblems, i.e., in a ``black-box'' approach. However, our research uncovers ahidden risk tied to this approach, which we call *adversarial helpfulness*.This happens when an LLM's explanations make a wrong answer look right,potentially leading people to trust incorrect solutions. In this paper, we showthat this issue affects not just humans, but also LLM evaluators. Diggingdeeper, we identify and examine key persuasive strategies employed by LLMs. Ourfindings reveal that these models employ strategies such as reframing thequestions, expressing an elevated level of confidence, and cherry-pickingevidence to paint misleading answers in a credible light. To examine if LLMsare able to navigate complex-structured knowledge when generating adversariallyhelpful explanations, we create a special task based on navigating throughgraphs. Most LLMs are not able to find alternative paths along simple graphs,indicating that their misleading explanations aren't produced by only logicaldeductions using complex knowledge. These findings shed light on thelimitations of the black-box explanation setting and allow us to provide adviceon the safe usage of LLMs.

Quick Read (beta)

loading the full paper ...