QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Abstract

This paper addresses the inherent inference latency challenges associatedwith deploying multimodal large language models (MLLM) in quadrupedvision-language-action (QUAR-VLA) tasks. Our investigation reveals thatconventional parameter reduction techniques ultimately impair the performanceof the language foundation model during the action instruction tuning phase,making them unsuitable for this purpose. We introduce a novel latency-freequadruped MLLM model, dubbed QUART-Online, designed to enhance inferenceefficiency without degrading the performance of the language foundation model.By incorporating Action Chunk Discretization (ACD), we compress the originalaction representation space, mapping continuous action values onto a smallerset of discrete representative vectors while preserving critical information.Subsequently, we fine-tune the MLLM to integrate vision, language, andcompressed actions into a unified semantic space. Experimental resultsdemonstrate that QUART-Online operates in tandem with the existing MLLM system,achieving real-time inference in sync with the underlying controller frequency,significantly boosting the success rate across various tasks by 65%. Ourproject page is https://quart-online.github.io.

Quick Read (beta)

loading the full paper ...