Abstract
Mitigating explicit and implicit biases in Large Language Models (LLMs) hasbecome a critical focus in the field of natural language processing. However,many current methodologies evaluate scenarios in isolation, without consideringthe broader context or the spectrum of potential biases within each situation.To address this, we introduce the Sensitivity Testing on Offensive Progressions(STOP) dataset, which includes 450 offensive progressions containing 2,700unique sentences of varying severity that progressively escalate from less tomore explicitly offensive. Covering a broad spectrum of 9 demographics and 46sub-demographics, STOP ensures inclusivity and comprehensive coverage. Weevaluate several leading closed- and open-source models, including GPT-4,Mixtral, and Llama 3. Our findings reveal that even the best-performing modelsdetect bias inconsistently, with success rates ranging from 19.3% to 69.8%. Wealso demonstrate how aligning models with human judgments on STOP can improvemodel answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairsby up to 191%, while maintaining or even improving performance. STOP presents anovel framework for assessing the complex nature of biases in LLMs, which willenable more effective bias mitigation strategies and facilitates the creationof fairer language models.