Abstract
We present SPHINX, a versatile multi-modal large language model (MLLM) with ajoint mixing of model weights, tuning tasks, and visual embeddings. First, forstronger vision-language alignment, we unfreeze the large language model (LLM)during pre-training, and introduce a weight mix strategy between LLMs trainedby real-world and synthetic data. By directly integrating the weights from twodomains, the mixed LLM can efficiently incorporate diverse semantics withfavorable robustness. Then, to enable multi-purpose capabilities, we mix avariety of tasks for joint visual instruction tuning, and design task-specificinstructions to avoid inter-task conflict. In addition to the basic visualquestion answering, we include more challenging tasks such as region-levelunderstanding, caption grounding, document layout detection, and human poseestimation, contributing to mutual enhancement over different scenarios.Additionally, we propose to extract comprehensive visual embeddings fromvarious network architectures, pre-training paradigms, and informationgranularity, providing language models with more robust image representations.Based on our proposed joint mixing, SPHINX exhibits superior multi-modalunderstanding capabilities on a wide range of applications. On top of this, wefurther propose an efficient strategy aiming to better capture fine-grainedappearances of high-resolution images. With a mixing of different scales andhigh-resolution sub-images, SPHINX attains exceptional visual parsing andreasoning performance on existing evaluation benchmarks. We hope our work maycast a light on the exploration of joint mixing in future MLLM research. Codeis released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.