AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

Abstract

The ability to reflect on and correct failures is crucial for robotic systemsto interact stably with real-life objects. Observing the generalization andreasoning capabilities of Multimodal Large Language Models (MLLMs), previousapproaches have aimed to utilize these models to enhance robotic systemsaccordingly. However, these methods typically focus on high-level planningcorrections using an additional MLLM, with limited utilization of failedsamples to correct low-level contact poses which is particularly prone to occurduring articulated object manipulation. To address this gap, we propose anAutonomous Interactive Correction (AIC) MLLM, which makes use of previouslow-level interaction experiences to correct SE(3) pose predictions forarticulated object. Specifically, AIC MLLM is initially fine-tuned to acquireboth pose prediction and feedback prompt comprehension abilities. We design twotypes of prompt instructions for interactions with objects: 1) visual masks tohighlight unmovable parts for position correction, and 2) textual descriptionsto indicate potential directions for rotation correction. During inference, aFeedback Information Extraction module is introduced to recognize the failurecause, allowing AIC MLLM to adaptively correct the pose prediction using thecorresponding prompts. To further enhance manipulation stability, we devise aTest Time Adaptation strategy that enables AIC MLLM to better adapt to thecurrent scene configuration. Finally, extensive experiments are conducted inboth simulated and real-world environments to evaluate the proposed method. Theresults demonstrate that our AIC MLLM can efficiently correct failure samplesby leveraging interaction experience prompts. Our project website ishttps://sites.google.com/view/aic-mllm.

Quick Read (beta)

loading the full paper ...