Abstract
Humans engage in lifelong social interactions through interacting withdifferent people under different scenarios for different social goals. Thisrequires social intelligence to gather information through a long time span anduse it to navigate various social contexts effectively. Whether AI systems arealso capable of this is understudied in the existing research. In this paper,we present a novel benchmark, LIFELONG-SOTOPIA, to perform a comprehensiveevaluation of language agents by simulating multi-episode interactions. In eachepisode, the language agents role-play characters to achieve their respectivesocial goals in randomly sampled social tasks. With LIFELONG-SOTOPIA, we findthat goal achievement and believability of all of the language models that wetest decline through the whole interaction. Although using an advanced memorymethod improves the agents' performance, the best agents still achieve asignificantly lower goal completion rate than humans on scenarios requiring anexplicit understanding of interaction history. These findings show that we canuse LIFELONG-SOTOPIA to evaluate the social intelligence of language agentsover lifelong social interactions.