Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars

Abstract

We propose a systematic learning-based approach to the generation of massivequantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2Dimages thereof, with associated ground truth information, for the purposes oftraining, benchmarking, and diagnosing learning-based computer vision androbotics algorithms. In particular, we devise a learning-based pipeline ofalgorithms capable of automatically generating and rendering a potentiallyinfinite variety of indoor scenes by using a stochastic grammar, represented asan attributed Spatial And-Or Graph, in conjunction with state-of-the-artphysics-based rendering. Our pipeline is capable of synthesizing scene layoutswith high diversity, and it is configurable inasmuch as it enables the precisecustomization and control of important attributes of the generated scenes. Itrenders photorealistic RGB images of the generated scenes while automaticallysynthesizing detailed, per-pixel ground truth data, including visible surfacedepth and normal, object identity, and material information (detailed to objectparts), as well as environments (e.g., illuminations and camera viewpoints). Wedemonstrate the value of our synthesized dataset, by improving performance incertain machine-learning-based scene understanding tasks--depth and surfacenormal prediction, semantic segmentation, reconstruction, etc.--and byproviding benchmarks for and diagnostics of trained models by modifying objectattributes and scene properties in a controllable manner.

Quick Read (beta)

loading the full paper ...