Abstract
Can large language models (LLMs) accurately simulate the next web action of aspecific user? While LLMs have shown promising capabilities in generating``believable'' human behaviors, evaluating their ability to mimic real userbehaviors remains an open challenge, largely due to the lack of high-quality,publicly available datasets that capture both the observable actions and theinternal reasoning of an actual human user. To address this gap, we introduceOPERA, a novel dataset of Observation, Persona, Rationale, and Action collectedfrom real human participants during online shopping sessions. OPERA is thefirst public dataset that comprehensively captures: user personas, browserobservations, fine-grained web actions, and self-reported just-in-timerationales. We developed both an online questionnaire and a custom browserplugin to gather this dataset with high fidelity. Using OPERA, we establish thefirst benchmark to evaluate how well current LLMs can predict a specific user'snext action and rationale with a given persona and <observation, action,rationale> history. This dataset lays the groundwork for future research intoLLM agents that aim to act as personalized digital twins for human.