Abstract
Using vision-language models (VLMs) as reward models in reinforcementlearning holds promise for reducing costs and improving safety. So far, VLMreward models have only been used for goal-oriented tasks, where the agent mustreach a particular final outcome. We explore VLMs' potential to supervise tasksthat cannot be scored by the final state alone. To this end, we introduceViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks.ViSTa comprises over 4,000 videos with step-by-step descriptions in virtualhome, Minecraft, and real-world environments. Its novel hierarchical structure-- basic single-step tasks composed into more and more complex sequential tasks-- allows a fine-grained understanding of how well VLMs can judge tasks withvarying complexity. To illustrate this, we use ViSTa to evaluatestate-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, whilethey are all good at object recognition, they fail to understand sequentialtasks, with only GPT-4o achieving non-trivial performance.