The 3D world limits the human body pose and the human body pose conveysinformation about the surrounding objects. Indeed, from a single image of aperson placed in an indoor scene, we as humans are adept at resolvingambiguities of the human pose and room layout through our knowledge of thephysical laws and prior perception of the plausible object and human poses.However, few computer vision models fully leverage this fact. In this work, wepropose an end-to-end trainable model that perceives the 3D scene from a singleRGB image, estimates the camera pose and the room layout, and reconstructs bothhuman body and object meshes. By imposing a set of comprehensive andsophisticated losses on all aspects of the estimations, we show that our modeloutperforms existing human body mesh methods and indoor scene reconstructionmethods. To the best of our knowledge, this is the first model that outputsboth object and human predictions at the mesh level, and performs jointoptimization on the scene and human poses.