Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

Abstract

Recent advancements in large language models (LLMs) have demonstratedemergent capabilities in complex reasoning, largely spurred by rule-basedReinforcement Learning (RL) techniques applied during the post-training. Thishas raised the question of whether similar methods can instill more nuanced,human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. Thispaper investigates whether small-scale LLMs can acquire a robust andgeneralizable ToM capability through RL with verifiable rewards (RLVR). Weconduct a systematic evaluation by training models on various combinations ofprominent ToM datasets (HiToM, ExploreToM, FANToM) and testing forgeneralization on held-out datasets (e.g., OpenToM). Our findings indicate thatsmall LLMs struggle to develop a generic ToM capability. While performance onin-distribution tasks improves, this capability fails to transfer to unseen ToMtasks with different characteristics. Furthermore, we demonstrate thatprolonged RL training leads to models ``hacking'' the statistical patterns ofthe training datasets, resulting in significant performance gains on in-domaindata but no change, or degradation of performance on out-of-distribution tasks.This suggests the learned behavior is a form of narrow overfitting rather thanthe acquisition of a true, abstract ToM capability.

Quick Read (beta)

loading the full paper ...