Abstract
Multimodal reasoning in Large Language Models (LLMs) struggles withincomplete knowledge and hallucination artifacts, challenges that textualKnowledge Graphs (KGs) only partially mitigate due to their modality isolation.While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modalunderstanding, their practical construction is impeded by semantic narrownessof manual text annotations and inherent noise in visual-semantic entitylinkages. In this paper, we propose Vision-align-to-Language integratedKnowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhancesLLMs reasoning through cross-modal information supplementation. Specifically,we cascade pre-trained Vision-Language Models (VLMs) to align image featureswith text, transforming them into descriptions that encapsulate image-specificinformation. Furthermore, we developed a cross-modal similarity verificationmechanism to quantify semantic consistency, effectively filtering out noiseintroduced during feature alignment. Even without manually annotated imagecaptions, the refined descriptions alone suffice to construct the MMKG.Compared to conventional MMKGs construction paradigms, our approach achievessubstantial storage efficiency gains while maintaining direct entity-to-imagelinkage capability. Experimental results on multimodal reasoning tasksdemonstrate that LLMs augmented with VaLiK outperform previous state-of-the-artmodels. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.