Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Abstract

Simulating step-wise human behavior with Large Language Models (LLMs) hasbecome an emerging research direction, enabling applications in variouspractical domains. While prior methods, including prompting, supervisedfine-tuning (SFT), and reinforcement learning (RL), have shown promise inmodeling step-wise behavior, they primarily learn a population-level policywithout conditioning on a user's persona, yielding generic rather thanpersonalized simulations. In this work, we pose a critical question: how canLLM agents better simulate personalized user behavior? We introduceCustomer-R1, an RL-based method for personalized, step-wise user behaviorsimulation in online shopping environments. Our policy is conditioned on anexplicit persona, and we optimize next-step rationale and action generation viaaction correctness reward signals. Experiments on the OPeRA dataset emonstratethat Customer-R1 not only significantly outperforms prompting and SFT-basedbaselines in next-action prediction tasks, but also better matches users'action distribution, indicating higher fidelity in personalized behaviorsimulation.

Quick Read (beta)

loading the full paper ...