Abstract
With the rapid development of artificial intelligence, multimodal learninghas become an important research area. For intelligent agents, the state is acrucial modality to convey precise information alongside common modalities likeimages, videos, and language. This becomes especially clear with the broadadoption of reinforcement learning and multimodal large language models.Nevertheless, the representation of state modality still lags in development.To this end, we propose a High-Fidelity Contrastive Language-State Pre-training(CLSP) method, which can accurately encode state information into generalrepresentations for both reinforcement learning and multimodal large languagemodels. Specifically, we first design a pre-training task based on theclassification to train an encoder with coarse-grained information. Next, weconstruct data pairs of states and language descriptions, utilizing thepre-trained encoder to initialize the CLSP encoder. Then, we deploy contrastivelearning to train the CLSP encoder to effectively represent precise stateinformation. Additionally, we enhance the representation of numericalinformation using the Random Fourier Features (RFF) method for high-fidelitymapping. Extensive experiments demonstrate the superior precision andgeneralization capabilities of our representation, achieving outstandingresults in text-state retrieval, reinforcement learning navigation tasks, andmultimodal large language model understanding.