Abstract
Recent progress in Multimodal Large Language Models (MLLMs) has highlightedthe critical roles of both the visual backbone and the underlying languagemodel. While prior work has primarily focused on scaling these components tobillions of parameters, the trade-offs between model size, architecture, andperformance remain underexplored. Additionally, inconsistencies in trainingdata and evaluation protocols have hindered direct comparisons, making itdifficult to derive optimal design choices. In this paper, we introduceLLaVA-MORE, a new family of MLLMs that integrates recent language models withdiverse visual backbones. To ensure fair comparisons, we employ a unifiedtraining protocol applied consistently across all architectures. Our analysissystematically explores both small- and medium-scale LLMs -- including Phi-4,LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, andinstruction following, while examining the relationship between model size andperformance. Beyond evaluating the LLM impact on final results, we conduct acomprehensive study of various visual encoders, ranging from CLIP-basedarchitectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additionalexperiments investigate the effects of increased image resolution andvariations in pre-training datasets. Overall, our results provide insights intothe design of more effective MLLMs, offering a reproducible evaluationframework that facilitates direct comparisons and can guide future modeldevelopment. Our source code and trained models are publicly available at:https://github.com/aimagelab/LLaVA-MORE.