Improving Large Vision-Language Models' Understanding for Field Data

Abstract

Large Vision-Language Models (LVLMs) have shown impressive capabilitiesacross a range of tasks that integrate visual and textual understanding, suchas image captioning and visual question answering. These models are trained onlarge-scale image and video datasets paired with text, enabling them to bridgevisual perception and natural language processing. However, their applicationto scientific domains, especially in interpreting complex field data commonlyused in the natural sciences, remains underexplored. In this work, we introduceFieldLVLM, a novel framework designed to improve large vision-language models'understanding of field data. FieldLVLM consists of two main components: afield-aware language generation strategy and a data-compressed multimodal modeltuning. The field-aware language generation strategy leverages aspecial-purpose machine learning pipeline to extract key physical features fromfield data, such as flow classification, Reynolds number, and vortex patterns.This information is then converted into structured textual descriptions thatserve as a dataset. The data-compressed multimodal model tuning focuses onLVLMs with these generated datasets, using a data compression strategy toreduce the complexity of field inputs and retain only the most informativevalues. This ensures compatibility with the models language decoder and guidesits learning more effectively. Experimental results on newly proposed benchmarkdatasets demonstrate that FieldLVLM significantly outperforms existing methodsin tasks involving scientific field data. Our findings suggest that thisapproach opens up new possibilities for applying large vision-language modelsto scientific research, helping bridge the gap between large models anddomain-specific discovery.

Quick Read (beta)

loading the full paper ...