Abstract
The visual world around us can be described as a structured set of objectsand their associated relations. An image of a room may be conjured given onlythe description of the underlying objects and their associated relations. Whilethere has been significant work on designing deep neural networks which maycompose individual objects together, less work has been done on composing theindividual relations between objects. A principal difficulty is that while theplacement of objects is mutually independent, their relations are entangled anddependent on each other. To circumvent this issue, existing works primarilycompose relations by utilizing a holistic encoder, in the form of text orgraphs. In this work, we instead propose to represent each relation as anunnormalized density (an energy-based model), enabling us to compose separaterelations in a factorized manner. We show that such a factorized decompositionallows the model to both generate and edit scenes that have multiple sets ofrelations more faithfully. We further show that decomposition enables our modelto effectively understand the underlying relational scene structure. Projectpage at: https://composevisualrelations.github.io/.