Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Abstract

Large language models (LLMs) have demonstrated impressive capabilities acrossvarious natural language processing (NLP) tasks in recent years. However, theirsusceptibility to jailbreaks and perturbations necessitates additionalevaluations. Many LLMs are multilingual, but safety-related training datacontains mainly high-resource languages like English. This can leave themvulnerable to perturbations in low-resource languages such as Polish. We showhow surprisingly strong attacks can be cheaply created by altering just a fewcharacters and using a small proxy model for word importance calculation. Wefind that these character and word-level attacks drastically alter thepredictions of different LLMs, suggesting a potential vulnerability that can beused to circumvent their internal safety mechanisms. We validate our attackconstruction methodology on Polish, a low-resource language, and find potentialvulnerabilities of LLMs in this language. Additionally, we show how it can beextended to other languages. We release the created datasets and code forfurther research.

Quick Read (beta)

loading the full paper ...