VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Abstract

Benefiting from language flexibility and compositionality, humans naturallyintend to use language to command an embodied agent for complex tasks such asnavigation and object manipulation. In this work, we aim to fill the blank ofthe last mile of embodied agents -- object manipulation by following humanguidance, e.g., "move the red mug next to the box while keeping it upright." Tothis end, we introduce an Automatic Manipulation Solver (AMSolver) simulatorand build a Vision-and-Language Manipulation benchmark (VLMbench) based on it,containing various language instructions on categorized robotic manipulationtasks. Specifically, modular rule-based task templates are created toautomatically generate robot demonstrations with language instructions,consisting of diverse object shapes and appearances, action types, and motionconstraints. We also develop a keypoint-based model 6D-CLIPort to deal withmulti-view observations and language input and output a sequence of 6 degreesof freedom (DoF) actions. We hope the new simulator and benchmark willfacilitate future research on language-guided robotic manipulation.

Quick Read (beta)

loading the full paper ...