SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

Abstract

Single-stage instance segmentation approaches have recently gained popularitydue to their speed and simplicity, but are still lagging behind in accuracy,compared to two-stage methods. We propose a fast single-stage instancesegmentation method, called SipMask, that preserves instance-specific spatialinformation by separating mask prediction of an instance to differentsub-regions of a detected bounding-box. Our main contribution is a novellight-weight spatial preservation (SP) module that generates a separate set ofspatial coefficients for each sub-region within a bounding-box, leading toimproved mask predictions. It also enables accurate delineation of spatiallyadjacent instances. Further, we introduce a mask alignment weighting loss and afeature alignment scheme to better correlate mask prediction with objectdetection. On COCO test-dev, our SipMask outperforms the existing single-stagemethods. Compared to the state-of-the-art single-stage TensorMask, SipMaskobtains an absolute gain of 1.0% (mask AP), while providing a four-foldspeedup. In terms of real-time capabilities, SipMask outperforms YOLACT with anabsolute gain of 3.0% (mask AP) under similar settings, while operating atcomparable speed on a Titan Xp. We also evaluate our SipMask for real-timevideo instance segmentation, achieving promising results on YouTube-VISdataset. The source code is available athttps://github.com/JialeCao001/SipMask.

Quick Read (beta)

loading the full paper ...