Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Abstract

We introduce Florence-2, a novel vision foundation model with a unified,prompt-based representation for a variety of computer vision andvision-language tasks. While existing large vision models excel in transferlearning, they struggle to perform a diversity of tasks with simpleinstructions, a capability that implies handling the complexity of variousspatial hierarchy and semantic granularity. Florence-2 was designed to taketext-prompt as task instructions and generate desirable results in text forms,whether it be captioning, object detection, grounding or segmentation. Thismulti-task learning setup demands large-scale, high-quality annotated data. Tothis end, we co-developed FLD-5B that consists of 5.4 billion comprehensivevisual annotations on 126 million images, using an iterative strategy ofautomated image annotation and model refinement. We adopted asequence-to-sequence structure to train Florence-2 to perform versatile andcomprehensive vision tasks. Extensive evaluations on numerous tasksdemonstrated Florence-2 to be a strong vision foundation model contender withunprecedented zero-shot and fine-tuning capabilities.

Quick Read (beta)

loading the full paper ...