Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

Abstract

Large language models (LLMs) have become ubiquitous, thus it is important tounderstand their risks and limitations. Smaller LLMs can be deployed wherecompute resources are constrained, such as edge devices, but with differentpropensity to generate harmful output. Mitigation of LLM harm typically dependson annotating the harmfulness of LLM output, which is expensive to collect fromhumans. This work studies two questions: How do smaller LLMs rank regardinggeneration of harmful content? How well can larger LLMs annotate harmfulness?We prompt three small LLMs to elicit harmful content of various types, such asdiscriminatory language, offensive content, privacy invasion, or negativeinfluence, and collect human rankings of their outputs. Then, we evaluate threestate-of-the-art large LLMs on their ability to annotate the harmfulness ofthese responses. We find that the smaller models differ with respect toharmfulness. We also find that large LLMs show low to moderate agreement withhumans. These findings underline the need for further work on harm mitigationin LLMs.

Quick Read (beta)

loading the full paper ...