Abstract
Despite the potential of language model-based agents to solve real-worldtasks such as web navigation, current methods still struggle with long-horizontasks with complex action trajectories. In contrast, humans can flexibly solvecomplex tasks by learning reusable task workflows from past experiences andusing them to guide future actions. To build agents that can similarly benefitfrom this process, we introduce Agent Workflow Memory (AWM), a method forinducing commonly reused routines, i.e., workflows, and selectively providingworkflows to the agent to guide subsequent generations. AWM flexibly applies toboth offline and online scenarios, where agents induce workflows from trainingexamples beforehand or from test queries on the fly. We experiment on two majorweb navigation benchmarks -- Mind2Web and WebArena -- that collectively cover1000+ tasks from 200+ domains across travel, shopping, and social media, amongothers. AWM substantially improves the baseline results by 24.6% and 51.1%relative success rate on Mind2Web and WebArena while reducing the number ofsteps taken to solve WebArena tasks successfully. Furthermore, online AWMrobustly generalizes in cross-task, website, and domain evaluations, surpassingbaselines from 8.9 to 14.0 absolute points as train-test task distribution gapswiden.