More than the Sum of Its Parts: Ensembling Backbone Networks for Few-Shot Segmentation

Abstract

Semantic segmentation is a key prerequisite to robust image understanding forapplications in \acrlong{ai} and Robotics. \acrlong{fss}, in particular,concerns the extension and optimization of traditional segmentation methods inchallenging conditions where limited training examples are available. Apredominant approach in \acrlong{fss} is to rely on a single backbone forvisual feature extraction. Choosing which backbone to leverage is a decidingfactor contributing to the overall performance. In this work, we interrogate onwhether fusing features from different backbones can improve the ability of\acrlong{fss} models to capture richer visual features. To tackle thisquestion, we propose and compare two ensembling techniques-Independent Votingand Feature Fusion. Among the available \acrlong{fss} methods, we implement theproposed ensembling techniques on PANet. The module dedicated to predictingsegmentation masks from the backbone embeddings in PANet avoids trainableparameters, creating a controlled `in vitro' setting for isolating the impactof different ensembling strategies. Leveraging the complementary strengths ofdifferent backbones, our approach outperforms the original single-backbonePANet across standard benchmarks even in challenging one-shot learningscenarios. Specifically, it achieved a performance improvement of +7.37\% onPASCAL-5\textsuperscript{i} and of +10.68\% on COCO-20\textsuperscript{i} inthe top-performing scenario where three backbones are combined. These results,together with the qualitative inspection of the predicted subject masks,suggest that relying on multiple backbones in PANet leads to a morecomprehensive feature representation, thus expediting the successfulapplication of \acrlong{fss} methods in challenging, data-scarce environments.

Quick Read (beta)

loading the full paper ...