MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation

Abstract

We propose Masked-Attention Transformers for Surgical Instrument Segmentation(MATIS), a two-stage, fully transformer-based method that leverages modernpixel-wise attention mechanisms for instrument segmentation. MATIS exploits theinstance-level nature of the task by employing a masked attention module thatgenerates and classifies a set of fine instrument region proposals. Our methodincorporates long-term video-level information through video transformers toimprove temporal consistency and enhance mask classification. We validate ourapproach in the two standard public benchmarks, Endovis 2017 and Endovis 2018.Our experiments demonstrate that MATIS' per-frame baseline outperforms previousstate-of-the-art methods and that including our temporal consistency moduleboosts our model's performance further.

Quick Read (beta)

loading the full paper ...