Towards Robust LLMs: an Adversarial Robustness Measurement Framework

Abstract

The rise of Large Language Models (LLMs) has revolutionized artificialintelligence, yet these models remain vulnerable to adversarial perturbations,undermining their reliability in high-stakes applications. While adversarialrobustness in vision-based neural networks has been extensively studied, LLMrobustness remains under-explored. We adapt the Robustness Measurement andAssessment (RoMA) framework to quantify LLM resilience against adversarialinputs without requiring access to model parameters. By comparing RoMA'sestimates to those of formal verification methods, we demonstrate its accuracywith minimal error margins while maintaining computational efficiency. Ourempirical evaluation reveals that robustness varies significantly not onlybetween different models but also across categories within the same task andbetween various types of perturbations. This non-uniformity underscores theneed for task-specific robustness evaluations, enabling practitioners tocompare and select models based on application-specific robustnessrequirements. Our work provides a systematic methodology to assess LLMrobustness, advancing the development of more reliable language models forreal-world deployment.

Quick Read (beta)

loading the full paper ...