Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation

  • 2022-06-24 17:48:23
  • Yueming Jin, Yang Yu, Cheng Chen, Zixu Zhao, Pheng-Ann Heng, Danail Stoyanov
  • 0

Abstract

Automatic surgical scene segmentation is fundamental for facilitatingcognitive intelligence in the modern operating theatre. Previous works rely onconventional aggregation modules (e.g., dilated convolution, convolutionalLSTM), which only make use of the local context. In this paper, we propose anovel framework STswinCL that explores the complementary intra- and inter-videorelations to boost segmentation performance, by progressively capturing theglobal context. We firstly develop a hierarchy Transformer to captureintra-video relation that includes richer spatial and temporal cues fromneighbor pixels and previous frames. A joint space-time window shift scheme isproposed to efficiently aggregate these two cues into each pixel embedding.Then, we explore inter-video relation via pixel-to-pixel contrastive learning,which well structures the global embedding space. A multi-source contrasttraining objective is developed to group the pixel embeddings across videoswith the ground-truth guidance, which is crucial for learning the globalproperty of the whole data. We extensively validate our approach on two publicsurgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.Experimental results demonstrate the promising performance of our method, whichconsistently exceeds previous state-of-the-art approaches. Code is available athttps://github.com/YuemingJin/STswinCL.

 

Quick Read (beta)

loading the full paper ...