Abstract
Large language models have shown their remarkable capabilities as a generalinterface for various language-related applications. Motivated by this, wetarget to build a unified interface for completing many vision-language tasksincluding image description, visual question answering, and visual grounding,among others. The challenge is to use a single model for performing diversevision-language tasks effectively with simple multi-modal instructions. Towardsthis objective, we introduce MiniGPT-v2, a model that can be treated as aunified interface for better handling various vision-language tasks. We proposeusing unique identifiers for different tasks when training the model. Theseidentifiers enable our model to better distinguish each task instructioneffortlessly and also improve the model learning efficiency for each task.After the three-stage training, the experimental results show that MiniGPT-v2achieves strong performance on many visual question-answering and visualgrounding benchmarks compared to other vision-language generalist models. Ourmodel and codes are available at https://minigpt-v2.github.io/