Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Abstract

The interactive nature of Large Language Models (LLMs) theoretically allowsmodels to refine and improve their answers, yet systematic analysis of themulti-turn behavior of LLMs remains limited. In this paper, we propose theFlipFlop experiment: in the first round of the conversation, an LLM completes aclassification task. In a second round, the LLM is challenged with a follow-upphrase like "Are you sure?", offering an opportunity for the model to reflecton its initial answer, and decide whether to confirm or flip its answer. Asystematic study of ten LLMs on seven classification tasks reveals that modelsflip their answers on average 46% of the time and that all models see adeterioration of accuracy between their first and final prediction, with anaverage drop of 17% (the FlipFlop effect). We conduct finetuning experiments onan open-source LLM and find that finetuning on synthetically created data canmitigate - reducing performance deterioration by 60% - but not resolvesycophantic behavior entirely. The FlipFlop experiment illustrates theuniversality of sycophantic behavior in LLMs and provides a robust framework toanalyze model behavior and evaluate future models.

Quick Read (beta)

loading the full paper ...