AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

Abstract

Significant progress has been made in advancing large multimodalconversational models (LMMs), capitalizing on vast repositories of image-textdata available online. Despite this progress, these models often encountersubstantial domain gaps, hindering their ability to engage in complexconversations across new domains. Recent efforts have aimed to mitigate thisissue, albeit relying on domain-specific image-text data to curateinstruction-tuning data. However, many domains, such as agriculture, lack suchvision-language data. In this work, we propose an approach to constructinstruction-tuning data that harnesses vision-only data for the agriculturedomain. We utilize diverse agricultural datasets spanning multiple domains,curate class-specific information, and employ large language models (LLMs) toconstruct an expert-tuning set, resulting in a 70k expert-tuning dataset calledAgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficientLMM that can hold complex agriculture-related conversations and provide usefulinsights. We also develop AgroEvals for evaluation and compare {AgroGPT's}performance with large open and closed-source models. {AgroGPT} excels atidentifying fine-grained agricultural concepts, can act as an agricultureexpert, and provides helpful information for multimodal agriculture questions.The code, datasets, and models are available athttps://github.com/awaisrauf/agroGPT.

Quick Read (beta)

loading the full paper ...