Abstract
To achieve successful assistance with long-horizon web-based tasks, AI agentsmust be able to sequentially follow real-world user instructions over a longperiod. Unlike existing web-based agent benchmarks, sequential instructionfollowing in the real world poses significant challenges beyond performing asingle, clearly defined task. For instance, real-world human instructions canbe ambiguous, require different levels of AI assistance, and may evolve overtime, reflecting changes in the user's mental state. To address this gap, weintroduce RealWebAssist, a novel benchmark designed to evaluate sequentialinstruction-following in realistic scenarios involving long-horizoninteractions with the web, visual GUI grounding, and understanding ambiguousreal-world user instructions. RealWebAssist includes a dataset of sequentialinstructions collected from real-world human users. Each user instructs aweb-based assistant to perform a series of tasks on multiple websites. Asuccessful agent must reason about the true intent behind each instruction,keep track of the mental state of the user, understand user-specific routines,and ground the intended tasks to actions on the correct GUI elements. Ourexperimental results show that state-of-the-art models struggle to understandand ground user instructions, posing critical challenges in followingreal-world user instructions for long-horizon web assistance.