Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Abstract

GPT-4o, an all-encompassing model, represents a milestone in the developmentof large multi-modal language models. It can understand visual, auditory, andtextual modalities, directly output audio, and support flexible duplexinteraction. Models from the open-source community often achieve somefunctionalities of GPT-4o, such as visual understanding and voice chat.Nevertheless, training a unified model that incorporates all modalities ischallenging due to the complexities of multi-modal data, intricate modelarchitectures, and training processes. In this paper, we introduce Mini-Omni2,a visual-audio assistant capable of providing real-time, end-to-end voiceresponses to visoin and audio queries. By integrating pretrained visual andauditory encoders, Mini-Omni2 maintains performance in individual modalities.We propose a three-stage training process to align modalities, allowing thelanguage model to handle multi-modal inputs and outputs after training on alimited dataset. For interaction, we introduce a command-based interruptionmechanism, enabling more flexible interaction with users. To the best of ourknowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which havesimilar form of functionality, and we hope it can offer valuable insights forsubsequent research.

Quick Read (beta)

loading the full paper ...