Abstract
Large Language Models (LLMs) have excelled at language understanding andgenerating human-level text. However, even with supervised training and humanalignment, these LLMs are susceptible to adversarial attacks where malicioususers can prompt the model to generate undesirable text. LLMs also inherentlyencode potential biases that can cause various harmful effects duringinteractions. Bias evaluation metrics lack standards as well as consensus andexisting methods often rely on human-generated templates and annotations whichare expensive and labor intensive. In this work, we train models toautomatically create adversarial prompts to elicit biased responses from targetLLMs. We present LLM- based bias evaluation metrics and also analyze severalexisting automatic evaluation methods and metrics. We analyze the variousnuances of model responses, identify the strengths and weaknesses of modelfamilies, and assess where evaluation methods fall short. We compare thesemetrics to human evaluation and validate that the LLM-as-a-Judge metric alignswith human judgement on bias in response generation.