Abstract
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advancemulti-modal Temporal Point Process (TPP) modeling in the era of Large LanguageModels (LLMs). While TPPs have been widely studied for modeling temporal eventsequences, existing datasets are predominantly unimodal, hindering progress inmodels that require joint reasoning over temporal, textual, and visualinformation. To address this gap, DanmakuTPPBench comprises two complementarycomponents: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibilivideo platform, where user-generated bullet comments (Danmaku) naturally formmulti-modal events annotated with precise timestamps, rich textual content, andcorresponding video frames; (2) DanmakuTPP-QA, a challenging question-answeringdataset constructed via a novel multi-agent pipeline powered bystate-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complextemporal-textual-visual reasoning. We conduct extensive evaluations using bothclassical TPP models and recent MLLMs, revealing significant performance gapsand limitations in current methods' ability to model multi-modal eventdynamics. Our benchmark establishes strong baselines and calls for furtherintegration of TPP modeling into the multi-modal language modeling landscape.Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench