ASTRAL: Automated Safety Testing of Large Language Models

Abstract

Large Language Models (LLMs) have recently gained attention due to theirability to understand and generate sophisticated human-like content. However,ensuring their safety is paramount as they might provide harmful and unsaferesponses. Existing LLM testing frameworks address various safety-relatedconcerns (e.g., drugs, terrorism, animal abuse) but often face challenges dueto unbalanced and obsolete datasets. In this paper, we present ASTRAL, a toolthat automates the generation and execution of test cases (i.e., prompts) fortesting the safety of LLMs. First, we introduce a novel black-box coveragecriterion to generate balanced and diverse unsafe test inputs across a diverseset of safety categories as well as linguistic writing characteristics (i.e.,different style and persuasive writing techniques). Second, we propose anLLM-based approach that leverages Retrieval Augmented Generation (RAG),few-shot prompting strategies and web browsing to generate up-to-date testinputs. Lastly, similar to current LLM test automation techniques, we leverageLLMs as test oracles to distinguish between safe and unsafe test outputs,allowing a fully automated testing approach. We conduct an extensive evaluationon well-known LLMs, revealing the following key findings: i) GPT3.5 outperformsother LLMs when acting as the test oracle, accurately detecting unsaferesponses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMsthat are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard);ii) the results confirm that our approach can uncover nearly twice as manyunsafe LLM behaviors with the same number of test inputs compared to currentlyused static datasets; and iii) our black-box coverage criterion combined withweb browsing can effectively guide the LLM on generating up-to-date unsafe testinputs, significantly increasing the number of unsafe LLM behaviors.

Quick Read (beta)

loading the full paper ...