ViperGPT: Visual Inference via Python Execution for Reasoning

Abstract

Answering visual queries is a complex task that requires both visualprocessing and reasoning. End-to-end models, the dominant approach for thistask, do not explicitly differentiate between the two, limitinginterpretability and generalization. Learning modular programs presents apromising alternative, but has proven challenging due to the difficulty oflearning both the programs and modules simultaneously. We introduce ViperGPT, aframework that leverages code-generation models to compose vision-and-languagemodels into subroutines to produce a result for any query. ViperGPT utilizes aprovided API to access the available modules, and composes them by generatingPython code that is later executed. This simple approach requires no furthertraining, and achieves state-of-the-art results across various complex visualtasks.

Quick Read (beta)

loading the full paper ...