Cross-Modal State-Space Graph Reasoning for Structured Summarization

Abstract

The ability to extract compact, meaningful summaries from large-scale andmultimodal data is critical for numerous applications, ranging from videoanalytics to medical reports. Prior methods in cross-modal summarization haveoften suffered from high computational overheads and limited interpretability.In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning}(\textbf{CSS-GR}) framework that incorporates a state-space model withgraph-based message passing, inspired by prior work on efficient state-spacemodels. Unlike existing approaches relying on purely sequential models, ourmethod constructs a graph that captures inter- and intra-modal relationships,allowing more holistic reasoning over both textual and visual streams. Wedemonstrate that our approach significantly improves summarization quality andinterpretability while maintaining computational efficiency, as validated onstandard multimodal summarization benchmarks. We also provide a thoroughablation study to highlight the contributions of each component.

Quick Read (beta)

loading the full paper ...