Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Abstract

As artificial intelligence systems grow more powerful, there has beenincreasing interest in "AI safety" research to address emerging and futurerisks. However, the field of AI safety remains poorly defined andinconsistently measured, leading to confusion about how researchers cancontribute. This lack of clarity is compounded by the unclear relationshipbetween AI safety benchmarks and upstream general capabilities (e.g., generalknowledge and reasoning). To address these issues, we conduct a comprehensivemeta-analysis of AI safety benchmarks, empirically analyzing their correlationwith general capabilities across dozens of models and providing a survey ofexisting directions in AI safety. Our findings reveal that many safetybenchmarks highly correlate with upstream model capabilities, potentiallyenabling "safetywashing" -- where capability improvements are misrepresented assafety advancements. Based on these findings, we propose an empiricalfoundation for developing more meaningful safety metrics and define AI safetyin a machine learning research context as a set of clearly delineated researchgoals that are empirically separable from generic capabilities advancements. Indoing so, we aim to provide a more rigorous framework for AI safety research,advancing the science of safety evaluations and clarifying the path towardsmeasurable progress.

Quick Read (beta)

loading the full paper ...