Exploring Scaling Trends in LLM Robustness

Abstract

Language model capabilities predictably improve from scaling a model's sizeand training data. Motivated by this, increasingly large language models havebeen trained, yielding an array of impressive capabilities. Yet these modelsare vulnerable to adversarial prompts, such as "jailbreaks" that hijack modelsto perform undesired behaviors, posing a significant risk of misuse. Prior workindicates that computer vision models become more robust with model and datascaling, raising the question: does language model robustness also improve withscale? We study this question empirically, finding that larger models respondsubstantially better to adversarial training, but there is little to no benefitfrom model scale in the absence of explicit defenses.

Quick Read (beta)

loading the full paper ...