Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models

Abstract

The introduction of advanced reasoning capabilities have improved theproblem-solving performance of large language models, particularly on math andcoding benchmarks. However, it remains unclear whether these reasoning modelsare more or less vulnerable to adversarial prompt attacks than theirnon-reasoning counterparts. In this work, we present a systematic evaluation ofweaknesses in advanced reasoning models compared to similar non-reasoningmodels across a diverse set of prompt-based attack categories. Usingexperimental data, we find that on average the reasoning-augmented models are\emph{slightly more robust} than non-reasoning models (42.51\% vs 45.53\%attack success rate, lower is better). However, this overall trend maskssignificant category-specific differences: for certain attack types thereasoning models are substantially \emph{more vulnerable} (e.g., up to 32percentage points worse on a tree-of-attacks prompt), while for others they aremarkedly \emph{more robust} (e.g., 29.8 points better on cross-site scriptinginjection). Our findings highlight the nuanced security implications ofadvanced reasoning in language models and emphasize the importance ofstress-testing safety across diverse adversarial techniques.

Quick Read (beta)

loading the full paper ...