LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

Abstract

In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficientmulti-modal assistant that harnesses the power of the recently advanced smalllanguage model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks anotable advancement in the realm of compact multi-modal models. It demonstratesthat even smaller language models, with as few as 2.7B parameters, caneffectively engage in intricate dialogues that integrate both textual andvisual elements, provided they are trained with high-quality corpora. Our modeldelivers commendable performance on publicly available benchmarks thatencompass visual comprehension, reasoning, and knowledge-based perception.Beyond its remarkable performance in multi-modal dialogue tasks, our modelopens new avenues for applications in time-sensitive environments and systemsthat require real-time interaction, such as embodied agents. It highlights thepotential of smaller language models to achieve sophisticated levels ofunderstanding and interaction, while maintaining greater resourceefficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

Quick Read (beta)

loading the full paper ...