Abstract
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable ofunderstanding spatial referring of any shape or granularity within an image andaccurately grounding open-vocabulary descriptions. To unify referring andgrounding in the LLM paradigm, Ferret employs a novel and powerful hybridregion representation that integrates discrete coordinates and continuousfeatures jointly to represent a region in the image. To extract the continuousfeatures of versatile regions, we propose a spatial-aware visual sampler, adeptat handling varying sparsity across different shapes. Consequently, Ferret canaccept diverse region inputs, such as points, bounding boxes, and free-formshapes. To bolster the desired capability of Ferret, we curate GRIT, acomprehensive refer-and-ground instruction tuning dataset including 1.1Msamples that contain rich hierarchical spatial knowledge, with 95K hardnegative data to promote model robustness. The resulting model not onlyachieves superior performance in classical referring and grounding tasks, butalso greatly outperforms existing MLLMs in region-based andlocalization-demanded multimodal chatting. Our evaluations also reveal asignificantly improved capability of describing image details and a remarkablealleviation in object hallucination. Code and data will be available athttps://github.com/apple/ml-ferret