Abstract
Perceiving visual scenes as objects and background--like humansdo--Object-Centric Learning (OCL) aggregates image or video feature maps intoobject-level feature vectors, termed \textit{slots}. OCL's self-supervision ofreconstructing the input from these aggregated slots struggles with complexobject textures, thus Vision Foundation Model (VFM) representations are used asthe aggregation input and reconstruction target. However, existing methodsleverage VFM representations in diverse ways and often fail to fully exploittheir potential. In response, we propose a clean architecture--Vector-QuantizedVFMs for OCL (VQ-VFM-OCL, or VVO)--that unifies mainstream OCL methods. The keyto our unification is simple yet effective, just shared quantizing the same VFMrepresentation as the reconstruction target. Through mathematical modeling andstatistical verification, we further analyze why VFM representations facilitateOCL aggregation and how their shared quantization as reconstruction targetsstrengthens OCL supervision. Experiments show that across different VFMs,aggregators and decoders, our VVO consistently outperforms baselines in objectdiscovery and recognition, as well as downstream visual prediction andreasoning. The implementation and model checkpoints are available onhttps://github.com/Genera1Z/VQ-VFM-OCL.