BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

Abstract

Large reasoning models (LRMs) have emerged as a significant advancement inartificial intelligence, representing a specialized class of large languagemodels (LLMs) designed to tackle complex reasoning tasks. The definingcharacteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoningcapabilities. In this paper, we identify a previously unexplored attack vectoragainst LRMs, which we term "overthinking backdoors". We advance this conceptby proposing a novel tunable backdoor, which moves beyond simple on/off attacksto one where an attacker can precisely control the extent of the model'sreasoning verbosity. Our attack is implemented through a novel data poisoningmethodology. It pairs a tunable trigger-where the number of repetitions signalsthe desired intensity-with a correspondingly verbose CoT response. Theseresponses are programmatically generated by instructing a teacher LLM to injecta controlled number of redundant refinement steps into a correct reasoningprocess. The approach preserves output correctness, which ensures stealth andestablishes the attack as a pure resource-consumption vector. Extensiveempirical results on various LRMs demonstrate that our method can reliablytrigger a controllable, multi-fold increase in the length of the reasoningprocess, without degrading the final answer's correctness. Our source code isavailable at https://github.com/FZaKK/BadReasoner.

Quick Read (beta)

loading the full paper ...