Abstract
Object-centric (OC) representations, which model visual scenes ascompositions of discrete objects, have the potential to be used in variousdownstream tasks to achieve systematic compositional generalization andfacilitate reasoning. However, these claims have yet to be thoroughly validatedempirically. Recently, foundation models have demonstrated unparalleledcapabilities across diverse domains, from language to computer vision,positioning them as a potential cornerstone of future research for a wide rangeof computational tasks. In this paper, we conduct an extensive empirical studyon representation learning for downstream Visual Question Answering (VQA),which requires an accurate compositional understanding of the scene. Wethoroughly investigate the benefits and trade-offs of OC models and alternativeapproaches including large pre-trained foundation models on both synthetic andreal-world data, ultimately identifying a promising path to leverage thestrengths of both paradigms. The extensiveness of our study, encompassing over600 downstream VQA models and 15 different types of upstream representations,also provides several additional insights that we believe will be of interestto the community at large.