A Roadmap to Pluralistic Alignment

Abstract

With increased power and prevalence of AI systems, it is ever more criticalthat AI systems are designed to serve all, i.e., people with diverse values andperspectives. However, aligning models to serve pluralistic human valuesremains an open research question. In this piece, we propose a roadmap topluralistic alignment, specifically using language models as a test bed. Weidentify and formalize three possible ways to define and operationalizepluralism in AI systems: 1) Overton pluralistic models that present a spectrumof reasonable responses; 2) Steerably pluralistic models that can steer toreflect certain perspectives; and 3) Distributionally pluralistic models thatare well-calibrated to a given population in distribution. We also propose andformalize three possible classes of pluralistic benchmarks: 1) Multi-objectivebenchmarks, 2) Trade-off steerable benchmarks, which incentivize models tosteer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks whichexplicitly model diverse human ratings. We use this framework to argue thatcurrent alignment techniques may be fundamentally limited for pluralistic AI;indeed, we highlight empirical evidence, both from our own experiments and fromother work, that standard alignment procedures might reduce distributionalpluralism in models, motivating the need for further research on pluralisticalignment.

Quick Read (beta)

loading the full paper ...