Vision-Language Models as Success Detectors

Abstract

Detecting successful behaviour is crucial for training intelligent agents. Assuch, generalisable reward models are a prerequisite for agents that can learnto generalise their behaviour. In this work we focus on developing robustsuccess detectors that leverage large, pretrained vision-language models(Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, wetreat success detection as a visual question answering (VQA) problem, denotedSuccessVQA. We study success detection across three vastly different domains:(i) interactive language-conditioned agents in a simulated household, (ii) realworld robotic manipulation, and (iii) "in-the-wild" human egocentric videos. Weinvestigate the generalisation properties of a Flamingo-based success detectionmodel across unseen language and visual changes in the first two domains, andfind that the proposed method is able to outperform bespoke reward models inout-of-distribution test scenarios with either variation. In the last domain of"in-the-wild" human videos, we show that success detection on unseen realvideos presents an even more challenging generalisation task warranting futurework. We hope our initial results encourage further work in real world successdetection and reward modelling.

Quick Read (beta)

loading the full paper ...