Aligning AI With Shared Human Values

Abstract

We show how to assess a language model's knowledge of basic concepts ofmorality. We introduce the ETHICS dataset, a new benchmark that spans conceptsin justice, well-being, duties, virtues, and commonsense morality. Modelspredict widespread moral judgments about diverse text scenarios. This requiresconnecting physical and social world knowledge to value judgements, acapability that may enable us to steer chatbot outputs or eventually regularizeopen-ended reinforcement learning agents. With the ETHICS dataset, we find thatcurrent language models have a promising but incomplete ability to predictbasic human ethical judgements. Our work shows that progress can be made onmachine ethics today, and it provides a steppingstone toward AI that is alignedwith human values.

Quick Read (beta)

loading the full paper ...