Abstract
Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models(LVLMs) have emerged as a pivotal advancement, bridging the gap between imageand text. However, video making it challenging for LVLMs to perform adequatelydue to the complexity of the relationship between language and spatial-temporaldata structure. Recent Large Video-Language Models (LVidLMs) align feature ofstatic visual data like image into latent space of language feature, by generalmulti-modal tasks to leverage abilities of LLMs sufficiently. In this paper, weexplore fine-grained alignment approach via object trajectory for differentmodalities across both spatial and temporal dimensions simultaneously. Thus, wepropose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbedPiTe, that exhibits promising applicable model property. To achievefine-grained video-language alignment, we curate a multi-modal pre-trainingdataset PiTe-143k, the dataset provision of moving trajectories in pixel levelfor all individual objects, that appear and mention in the video and captionboth, by our automatic annotation pipeline. Meanwhile, PiTe demonstratesastounding capabilities on myriad video-related multi-modal tasks through beatthe state-of-the-art methods by a large margin.