Training language models to be warm and empathetic makes them less reliable and more sycophantic

Abstract

Artificial intelligence (AI) developers are increasingly building languagemodels with warm and empathetic personas that millions of people now use foradvice, therapy, and companionship. Here, we show how this creates asignificant trade-off: optimizing language models for warmth undermines theirreliability, especially when users express vulnerability. We conductedcontrolled experiments on five language models of varying sizes andarchitectures, training them to produce warmer, more empathetic responses, thenevaluating them on safety-critical tasks. Warm models showed substantiallyhigher error rates (+10 to +30 percentage points) than their originalcounterparts, promoting conspiracy theories, providing incorrect factualinformation, and offering problematic medical advice. They were alsosignificantly more likely to validate incorrect user beliefs, particularly whenuser messages expressed sadness. Importantly, these effects were consistentacross different model architectures, and occurred despite preservedperformance on standard benchmarks, revealing systematic risks that currentevaluation practices may fail to detect. As human-like AI systems are deployedat an unprecedented scale, our findings indicate a need to rethink how wedevelop and oversee these systems that are reshaping human relationships andsocial interaction.

Quick Read (beta)

loading the full paper ...