Abstract
Large multimodal models (LMMs) have garnered wide-spread attention andinterest within the artificial intelligence research and industrialcommunities, owing to their remarkable capability in multimodal understanding,reasoning, and in-context learning, among others. While LMMs have demonstratedpromising results in tackling multimodal tasks like image captioning, visualquestion answering, and visual grounding, the object detection capabilities ofLMMs exhibit a significant gap compared to specialist detectors. To bridge thegap, we depart from the conventional methods of integrating heavy detectorswith LMMs and propose LMM-Det, a simple yet effective approach that leverages aLarge Multimodal Model for vanilla object Detection without relying onspecialized detection modules. Specifically, we conduct a comprehensiveexploratory analysis when a large multimodal model meets with object detection,revealing that the recall rate degrades significantly compared with specialistdetection models. To mitigate this, we propose to increase the recall rate byintroducing data distribution adjustment and inference optimization tailoredfor object detection. We re-organize the instruction conversations to enhancethe object detection capabilities of large multimodal models. We claim that alarge multimodal model possesses detection capability without any extradetection modules. Extensive experiments support our claim and show theeffectiveness of the versatile LMM-Det. The datasets, models, and codes areavailable at https://github.com/360CVGroup/LMM-Det.