DeepSeek-VL: Towards Real-World Vision-Language Understanding

Abstract

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designedfor real-world vision and language understanding applications. Our approach isstructured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively coversreal-world scenarios including web screenshots, PDFs, OCR, charts, andknowledge-based content, aiming for a comprehensive representation of practicalcontexts. Further, we create a use case taxonomy from real user scenarios andconstruct an instruction tuning dataset accordingly. The fine-tuning with thisdataset substantially improves the model's user experience in practicalapplications. Considering efficiency and the demands of most real-worldscenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficientlyprocesses high-resolution images (1024 x 1024), while maintaining a relativelylow computational overhead. This design choice ensures the model's ability tocapture critical semantic and detailed information across various visual tasks.We posit that a proficient Vision-Language Model should, foremost, possessstrong language abilities. To ensure the preservation of LLM capabilitiesduring pretraining, we investigate an effective VL pretraining strategy byintegrating LLM training from the beginning and carefully managing thecompetitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior userexperiences as a vision-language chatbot in real-world applications, achievingstate-of-the-art or competitive performance across a wide range ofvisual-language benchmarks at the same model size while maintaining robustperformance on language-centric benchmarks. We have made both 1.3B and 7Bmodels publicly accessible to foster innovations based on this foundationmodel.

Quick Read (beta)

loading the full paper ...