DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

Abstract

Generating robot demonstrations through simulation is widely recognized as aneffective way to scale up robot data. Previous work often trained reinforcementlearning agents to generate expert policies, but this approach lacks sampleefficiency. Recently, a line of work has attempted to generate robotdemonstrations via differentiable simulation, which is promising but heavilyrelies on reward design, a labor-intensive process. In this paper, we proposeDiffGen, a novel framework that integrates differentiable physics simulation,differentiable rendering, and a vision-language model to enable automatic andefficient generation of robot demonstrations. Given a simulated robotmanipulation scenario and a natural language instruction, DiffGen can generaterealistic robot demonstrations by minimizing the distance between the embeddingof the language instruction and the embedding of the simulated observationafter manipulation. The embeddings are obtained from the vision-language model,and the optimization is achieved by calculating and descending gradientsthrough the differentiable simulation, differentiable rendering, andvision-language model components, thereby accomplishing the specified task.Experiments demonstrate that with DiffGen, we could efficiently and effectivelygenerate robot data with minimal human effort or training time.

Quick Read (beta)

loading the full paper ...