DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Abstract

MLLMs have demonstrated remarkable comprehension and reasoning capabilitieswith complex language and visual data. These advances have spurred the visionof establishing a generalist robotic MLLM proficient in understanding complexhuman instructions and accomplishing various embodied tasks. However,developing MLLMs for real-world robots is challenging due to the typicallylimited computation and memory capacities available on robotic platforms. Incontrast, the inference of MLLMs involves storing billions of parameters andperforming tremendous computation, imposing significant hardware demands. Inour paper, we propose a Dynamic Early-Exit Framework for RoboticVision-Language-Action Model (DeeR-VLA, or simply DeeR) that automaticallyadjusts the size of the activated MLLM based on each situation at hand. Theapproach leverages a multi-exit architecture in MLLMs, which allows the modelto terminate processing once a proper size of the model has been activated fora specific situation, thus avoiding further redundant computation.Additionally, we develop novel algorithms that establish early-terminationcriteria for DeeR, conditioned on predefined demands such as averagecomputational cost (i.e., power consumption), as well as peak computationalconsumption (i.e., latency) and GPU memory usage. These enhancements ensurethat DeeR operates efficiently under varying resource constraints whilemaintaining competitive performance. On the CALVIN robot manipulationbenchmark, DeeR demonstrates significant reductions in computational costs ofLLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

Quick Read (beta)

loading the full paper ...