Abstract
Goal-oriented planning, or anticipating a series of actions that transitionan agent from its current state to a predefined objective, is crucial fordeveloping intelligent assistants aiding users in daily procedural tasks. Theproblem presents significant challenges due to the need for comprehensiveknowledge of temporal and hierarchical task structures, as well as strongcapabilities in reasoning and planning. To achieve this, prior work typicallyrelies on extensive training on the target dataset, which often results insignificant dataset bias and a lack of generalization to unseen tasks. In thiswork, we introduce VidAssist, an integrated framework designed forzero/few-shot goal-oriented planning in instructional videos. VidAssistleverages large language models (LLMs) as both the knowledge base and theassessment tool for generating and evaluating action plans, thus overcoming thechallenges of acquiring procedural knowledge from small-scale, low-diversitydatasets. Moreover, VidAssist employs a breadth-first search algorithm foroptimal plan generation, in which a composite of value functions designed forgoal-oriented planning is utilized to assess the predicted actions at eachstep. Extensive experiments demonstrate that VidAssist offers a unifiedframework for different goal-oriented planning setups, e.g., visual planningfor assistance (VPA) and procedural planning (PP), and achieves remarkableperformance in zero-shot and few-shot setups. Specifically, our few-shot modeloutperforms the prior fully supervised state-of-the-art method by +7.7% in VPAand +4.81% PP task on the COIN dataset while predicting 4 future actions. Code,and models are publicly available at https://sites.google.com/view/vidassist.