EndoGen: Conditional Autoregressive Endoscopic Video Generation

Abstract

Endoscopic video generation is crucial for advancing medical imaging andenhancing diagnostic capabilities. However, prior efforts in this field haveeither focused on static images, lacking the dynamic context required forpractical applications, or have relied on unconditional generation that failsto provide meaningful references for clinicians. Therefore, in this paper, wepropose the first conditional endoscopic video generation framework, namelyEndoGen. Specifically, we build an autoregressive model with a tailoredSpatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates thelearning of generating multiple frames as a grid-based image generationpattern, which effectively capitalizes the inherent global dependency modelingcapabilities of autoregressive architectures. Furthermore, we propose aSemantic-Aware Token Masking (SAT) mechanism, which enhances the model'sability to produce rich and diverse content by selectively focusing onsemantically meaningful regions during the generation process. Throughextensive experiments, we demonstrate the effectiveness of our framework ingenerating high-quality, conditionally guided endoscopic content, and improvesthe performance of downstream task of polyp segmentation. Code released athttps://www.github.com/CUHK-AIM-Group/EndoGen.

Quick Read (beta)

loading the full paper ...