Abstract
Vision-language models (VLMs) have shown remarkable progress in offline taskssuch as image captioning and video question answering. However, real-timeinteractive environments impose new demands on VLMs, requiring them to generateutterances that are not only semantically accurate but also precisely timed. Weidentify two core capabilities necessary for such settings --$\textit{perceptual updating}$ and $\textit{contingency awareness}$ -- andpropose a new benchmark task, $\textbf{Temporally-Grounded Language Generation(TGLG)}$, to evaluate them. TGLG requires models to generate utterances inresponse to streaming video such that both content and timing align withdynamic visual input. To support this benchmark, we curate evaluation datasetsfrom sports broadcasting and egocentric human interaction domains, andintroduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuringsemantic similarity and temporal alignment. Finally, we present$\textbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$,a model that interleaves visual and linguistic tokens in a time-synchronizedmanner, enabling real-time language generation without relying on turn-basedassumptions. Experimental results show that VLM-TSI significantly outperforms astrong baseline, yet overall performance remains modest -- highlighting thedifficulty of TGLG and motivating further research in real-time VLMs. Code anddata available $\href{https://github.com/yukw777/tglg}{here}$.