MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, suchas directly generating websites from handwritten text and identifying humorouselements within images. These features are rarely observed in previousvision-language models. However, the technical details behind GPT-4 continue toremain undisclosed. We believe that the enhanced multi-modal generationcapabilities of GPT-4 stem from the utilization of sophisticated large languagemodels (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns afrozen visual encoder with a frozen advanced LLM, Vicuna, using one projectionlayer. Our work, for the first time, uncovers that properly aligning the visualfeatures with an advanced large language model can possess numerous advancedmulti-modal abilities demonstrated by GPT-4, such as detailed image descriptiongeneration and website creation from hand-drawn drafts. Furthermore, we alsoobserve other emerging capabilities in MiniGPT-4, including writing stories andpoems inspired by given images, teaching users how to cook based on foodphotos, and so on. In our experiment, we found that the model trained on shortimage caption pairs could produce unnatural language outputs (e.g., repetitionand fragmentation). To address this problem, we curate a detailed imagedescription dataset in the second stage to finetune the model, whichconsequently improves the model's generation reliability and overall usability.Our code, pre-trained model, and collected dataset are available athttps://minigpt-4.github.io/.

Quick Read (beta)

loading the full paper ...