LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Abstract

Existing Large Vision-Language Models (LVLMs) can process inputs with contextlengths up to 128k visual and text tokens, yet they struggle to generatecoherent outputs beyond 1,000 words. We find that the primary limitation is theabsence of long output examples during supervised fine-tuning (SFT). To tacklethis issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158examples, each with multiple input images, an instruction, and correspondingoutputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs thatmaintain high-fidelity to the input images, we employ Direct PreferenceOptimization (DPO) to the SFT model. Given the high cost of collecting humanfeedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, whichbreaks long outputs into segments and uses iterative corrections to formpreference pairs with the original outputs. Additionally, we developMMLongBench-Write, a benchmark featuring six tasks to evaluate thelong-generation capabilities of VLMs. Our 7B parameter model, trained withLongWriter-V-22k and IterDPO, achieves impressive performance on thisbenchmark, outperforming larger proprietary models like GPT-4o. Code and data:https://github.com/THU-KEG/LongWriter-V

Quick Read (beta)

loading the full paper ...