Gaussian Masked Autoencoders

Abstract

This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. Whilereconstructive self-supervised learning frameworks such as MAE learns goodsemantic abstractions, it is not trained for explicit spatial awareness. Ourapproach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semanticabstractions and spatial understanding jointly. Like MAE, it reconstructs theimage end-to-end in the pixel space, but beyond MAE, it also introduces anintermediate, 3D Gaussian-based representation and renders images viasplatting. We show that GMAE can enable various zero-shot learning capabilitiesof spatial understanding (e.g., figure-ground segmentation, image layering,edge detection, etc.) while preserving the high-level semantics ofself-supervised representation quality from MAE. To our knowledge, we are thefirst to employ Gaussian primitives in an image representation learningframework beyond optimization-based single-scene reconstructions. We believeGMAE will inspire further research in this direction and contribute todeveloping next-generation techniques for modeling high-fidelity visual data.More details at https://brjathu.github.io/gmae

Quick Read (beta)

loading the full paper ...