VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

  • 2025-08-21 02:12:56
  • Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, Bin Liu
  • 0

Abstract

Although current mainstream pre-trained large models, such as LLM modelsrepresented by ChatGPT and VLA models represented by OpenVLA, have achievedsignificant progress in multimodal tasks through a "Multiple-Input,Single-Output" (MISO) architecture. However, our investigation reveals that theMISO architecture exhibits fundamental limitations in "Multiple-Input,Multiple-Output" (MIMO) (e.g., parallel multi-tasks output processing): thearchitecture generates task mutual exclusion effects, leading to resourcecontention among different tasks when sharing output channels, and consequentlyresulting in optimization imbalance and performance degradation. In contrast,human MIMO processing inherently enables concurrent task execution (e.g., whiledialogue and decision-making) without interference. Inspired by this, in thiswork, we propose a unified MIMO training model with parallel multi-tasks outputcapabilities termed Visual Language Action Model for Simultaneously Chattingand Decision Making. We refer to this method as VLASCD or MIMO-VLA, and in thefollowing, we will use these two names interchangeably. We evaluate the modelon the CARLA autonomous driving platform. The results show that, compared toLLM models with MISO dialogue capabilities, reinforcement learning models, andVLA models with MISO decision-making capabilities, MIMO-VLA significantlyoutperforms existing MISO models in simultaneously handling dialogue generationand decision-making tasks within the MIMO scenario.

 

Quick Read (beta)

loading the full paper ...