MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Abstract

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - asimple and effective extension to visual instruction tuning that enables apretrained LLM to quickly morph into an unified autoregressive model capable ofgenerating both text and visual tokens. VPiT teaches an LLM to predict discretetext tokens and continuous visual tokens from any input sequence of image andtext data curated in an instruction-following format. Our empiricalinvestigation reveals several intriguing properties of VPiT: (1) visualgeneration ability emerges as a natural byproduct of improved visualunderstanding, and can be unlocked efficiently with a small amount ofgeneration data; (2) while we find understanding and generation to be mutuallybeneficial, understanding data contributes to both capabilities moreeffectively than generation data. Building upon these findings, we train ourMetaMorph model and achieve competitive performance on both visualunderstanding and generation. In visual generation, MetaMorph can leverage theworld knowledge and reasoning abilities gained from LLM pretraining, andovercome common failure modes exhibited by other generation models. Our resultssuggest that LLMs may have strong "prior" vision capabilities that can beefficiently adapted to both visual understanding and generation with arelatively simple instruction tuning process.

Quick Read (beta)

loading the full paper ...