Abstract
Vision language models (VLMs) typically pair a modestly sized vision encoderwith a large language model (LLM), e.g., Llama-70B, making the decoder theprimary computational burden during training. To reduce costs, a potentialpromising strategy is to first train the vision encoder using a small languagemodel before transferring it to the large one. We construct small "surrogatemodels" that share the same embedding space and representation language as thelarge target LLM by directly inheriting its shallow layers. Vision encoderstrained on the surrogate can then be directly transferred to the larger model,a process we call zero-shot grafting -- when plugged directly into thefull-size target LLM, the grafted pair surpasses the encoder-surrogate pairand, on some benchmarks, even performs on par with full decoder training withthe target LLM. Furthermore, our surrogate training approach reduces overallVLM training costs by ~45% when using Llama-70B as the decoder.