Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling

  • 2019-08-01 06:01:19
  • Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, Zihan Zhou
  • 17


Recently, there has been growing interest in developing learning-basedmethods to detect and utilize salient semi-global or global structures, such asjunctions, lines, planes, cuboids, smooth surfaces, and all types ofsymmetries, for 3D scene modeling and understanding. However, the ground truthannotations are often obtained via human labor, which is particularlychallenging and inefficient for such tasks due to the large number of 3Dstructure instances (e.g., line segments) and other factors such as viewpointsand occlusions. In this paper, we present a new synthetic dataset,Structured3D, with the aim to providing large-scale photo-realistic images withrich 3D structure annotations for a wide spectrum of structured 3D modelingtasks. We take advantage of the availability of millions of professionalinterior designs and automatically extract 3D structures from them. We generatehigh-quality images with an industry-leading rendering engine. We use oursynthetic dataset in combination with real images to train deep neural networksfor room layout estimation and demonstrate improved performance on benchmarkdatasets.


Quick Read (beta)

Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling

Jia Zheng1* Junfei Zhang2* Jing Li1* Rui Tang2 Shenghua Gao1 Zihan Zhou3
1ShanghaiTech University 2KooLab, 3The Pennsylvania State University

Recently, there has been growing interest in developing learning-based methods to detect and utilize salient semi-global or global structures, such as junctions, lines, planes, cuboids, smooth surfaces, and all types of symmetries, for 3D scene modeling and understanding. However, the ground truth annotations are often obtained via human labor, which is particularly challenging and inefficient for such tasks due to the large number of 3D structure instances (e.g., line segments) and other factors such as viewpoints and occlusions. In this paper, we present a new synthetic dataset, Structured3D, with the aim to providing large-scale photo-realistic images with rich 3D structure annotations for a wide spectrum of structured 3D modeling tasks. We take advantage of the availability of millions of professional interior designs and automatically extract 3D structures from them. We generate high-quality images with an industry-leading rendering engine. We use our synthetic dataset in combination with real images to train deep neural networks for room layout estimation and demonstrate improved performance on benchmark datasets.

Figure 1: The Structured3D dataset. From a large collection of house designs (a) created by professional designers, we automatically extract a variety of ground truth 3D structure annotations (b) and generate photo-realistic 2D images (c).
* Equal contribution

1 Introduction

Inferring 3D information from 2D sensory data such as images and videos has long been a central research topic in computer vision. Conventional approach to build 3D models of a scene typically relies on detecting, matching, and triangulating local image features (e.g., patches, superpixels, edges, and SIFT features). Although significant progress has been made over the past decades, these methods still suffer from some fundamental problems. In particular, local feature detection is sensitive to a large number of factors such as scene appearance (e.g., textureless areas and repetitive patterns), lighting conditions, and occlusions. Further, the noisy, point cloud-based 3D model often fails to meet the increasing demand for high-level 3D understanding in real-world applications.

Table 1: An overview of structured 3D scene datasets. : The actual numbers are not explicitly given and hard to estimate, because these datasets contain images downloaded from Internet (LSUN Room Layout, PanoContext), or from multiple sources (LayoutNet, Realtor360). *: Dataset is unavailable online at the time of submission.
Datasets #Scenes #Rooms #Frames Annotated structure
PlaneRCNN [12] - - 100,000 planes
Wireframe [9] - - 5,462 wireframe (2D)
SUN Primitive [27] - - 785 cuboids, other primitives
LSUN Room Layout [33] - n/a 5,396 cuboid layout
PanoContext [31] - n/a 500 (pano) cuboid layout
LayoutNet [34] - n/a 1,071 (pano) cuboid layout
Realtor360* [29] - n/a 2,573 (pano) Manhattan layout
Raster-to-Vector [14] 870 - - floorplan
Structured3D 3,500 21,835 196,515 “primitive + relationship”

When perceiving 3D scenes, humans are remarkably effective in using salient global structures such as lines, contours, planes, smooth surfaces, symmetries, and repetitive patterns. Thus, if a reconstruction algorithm can take advantage of such global information, it is natural to expect the algorithm to obtain more accurate results. Traditionally, however, it has been computationally challenging to reliably detect such global structures from noisy local image features. Recently, deep learning-based methods have shown promising results in detecting various forms of structure directly from the images, including lines [9], planes [15, 28, 12, 30], cuboids [7], floorplans [14, 13], room layouts [10, 34, 21], abstracted 3D shapes [22, 25], and smooth surfaces [8].

With the fast development of deep learning methods comes the need for large amounts of accurately annotated data. In order to train the proposed neural networks, most prior work collects their own sets of images and manually label the structure of interest in them. Such a strategy has several shortcomings. First, due to the tedious process of manually labelling and verifying all the structure instances (e.g., line segments) in each image, existing datasets typically have limited sizes and scene diversity. And the annotations may also contain errors. Second, since each study primarily focuses on one type of structure, none of these datasets has multiple types of structure labeled. As a result, existing methods are unable to exploit relations between different types of structure (e.g., lines and planes) as humans do for effective, efficient, and robust 3D reconstruction.

In this paper, we present a large synthetic dataset with rich annotations of 3D structure and photo-realistic 2D renderings of indoor man-made environments (Figure 1). At the core of our dataset design is a unified representation of 3D structure which enables us to efficiently capture multiple types of 3D structure in the scene. Specifically, the proposed representation considers any structure as relationship among geometric primitives. For example, a “wireframe” structure encodes the incidence and intersection relationship between line segments, whereas a “cuboid” structure encodes the rotational and reflective symmetry relationship among its planar faces. With our “primitive + relationship” representation, one can easily derive the ground truth annotations for a wide variety of semi-global and global structures (e.g., lines, wireframes, planes, regular shapes, floorplans, and room layouts), and also exploit their relations in future data-driven approaches (e.g., the wireframe formed by intersecting planar surfaces in the scene).

To create a large-scale dataset with the aim to facilitate research on data-driven methods for structured 3D scene understanding, we leverage the availability of millions of professional interior designs and millions of production-level 3D object models – all coming with fine geometric details and high-resolution texture (Figure 1(a)). We first use computer programs to automatically extract information about 3D structure from the original house design files. As shown in Figure 1(b), our dataset contains rich annotations of 3D room structure including a variety of geometric primitives and relationships. To further generate photo-realistic 2D images (Figure 1(c)), we utilize industry-leading rendering engines to model the lighting conditions. Currently, our dataset consists of more than 196k images of 21,835 rooms in 3,500 scenes (i.e., houses).

To showcase the usefulness and uniqueness of the proposed Structured3D dataset, we train deep networks for room layout estimation on a subset of the dataset. We show that the models first trained on our synthetic data and then fine-tuned on real data outperform the models trained on real data only. We also show good generalizability of the models trained on our synthetic data by directly applying them to real world images.

In summary, the main contributions of this paper are:

  • We introduce a unified “primitive + relationship” representation for 3D structure. This representation enables us to efficiently capture a wide variety of semi-global and global 3D structures, as well as their mutual relationships.

  • We create the Structured3D dataset, which contains rich ground truth 3D structure annotations of 21,835 rooms in 3,500 scenes, and more than 196k photo-realistic 2D renderings of the rooms.

  • We verify the usefulness of our dataset by using it to train deep networks for room layout estimation and demonstrating improved performance on benchmark datasets.

2 Related Work

Datasets. Table 1 summarizes existing datasets for structured 3D scene modeling. Additionally, [22, 25] provide datasets with structured representations of single objects. We show example annotations in these datasets in Figure 2. Note that ground truth annotations in most datasets are manually labeled. This is one main reason why all these datasets have limited size, i.e., contain no more than a few thousand images. The only exception is [12], which employs a multi-model fitting algorithm to automatically extract planes from 3D scans in the ScanNet dataset [6]. But such algorithms are sensitive to data noises and outliers, thus introduce errors in the annotations (Figure 2(a)). Further, none of these datasets has more than one type of structure labeled, although different types of structure often have strong relations among them. For example, from the wireframe in Figure 2(b) humans can easily identify other types of structure such as planes and cuboids. Our new dataset sets to bridge the gap between what is needed to train machine learning models to achieve human-level holistic 3D scene understanding and what is being offered by existing datasets.

(a) Plane [12] (b) Wireframe [9] (c) Cuboid [7]
(d) Room layout [33] (e) Floorplan [14]
(f) Abstracted 3D shape (wireframe [25] and cuboid [22])
Figure 2: Example annotations of structure in existing datasets. The reference number indicates the paper from which the illustration is originally from.

Note that our dataset is very different from other popular large-scale 3D datasets, such as NYU v2 [18], SUN RGB-D [19], 2D-3D-S [3, 2], ScanNet [6], and Matterport3D [5], in which the ground truth 3D information is stored in the format of point clouds or meshes. These datasets lack ground truth annotations of semi-global or global structures. While it is theoretically possible to extract 3D structure by applying structure detection algorithms to the point clouds or meshes (e.g., extracting planes from ScanNet as did in [12]), the detection results are often noisy and even contain errors. In addition, for some types of structure like wireframes and room layouts, how to reliably detect them from raw sensor data remains an active research topic in computer vision.

In recent years, synthetic datasets have played an important role in successful training of deep neural networks. Notable examples for indoor scene understanding include SUNCG [20], SceneNet RGB-D [16], and InteriorNet [11]. These datasets exceed real datasets in terms of scene diversity and frame numbers. But just like their real counterparts, these datasets lack ground truth structure annotations. Another issue with some synthetic datasets is the degree of realism in both the 3D models and the 2D renderings. [32] shows that physically-based rendering could boost the performance of various indoor scene understanding tasks. To ensure the quality of our dataset, we make use of 3D room models created by professional designers and the state-of-the-art industrial rendering engines in this work.

Room layout estimation. Room layout estimation aims to reconstruct the enclosing structure of the indoor scene, consisting of walls, floor, and ceiling. Existing public datasets (e.g., PanoContext [31] and LayoutNet [34]) assume a simple cuboid-shape layout. PanoContext [31] collects about 500 panoramas from the SUN360 dataset [26], LayoutNet [34] extends the layout annotations to include panoramas from 2D-3D-S [2]. Recently, Realtor360 [29] collects 2,500 indoor panoramas from SUN360 [26] and a real-estate database, and provides annotation of a more general Manhattan layout. We note that all room layout in these real datasets is manually labeled by the human. Since the room structure may be occluded by furniture and other objects, the “ground truth” inferred by humans may be not consistent with the actual layout. In our dataset, all ground truth 3D annotations are automatically extracted from the original house design files.

(a) Primitives: junctions and lines (b) Primitives: planes (c) Relationships: R1 and R2
(d) Relationships: R3 (e) Relationships: R4 (f) Relationships: R5
Figure 3: The ground truth 3D structure annotations in our dataset are represented by primitives and relationships. (a): Junctions and lines. (b): Planes. We highlight the planes in a single room. (c): Plane-line and line-junction relationships. We highlight a junction, the three lines intersecting at the junction, and the planes intersecting at each of the lines. (d): Cuboids. We highlight one cuboid instance. (e): Manhattan world. We use different colors to denote planes aligned with different directions. (f): Semantic objects. We highlight a “room”, a “balcony”, and the “door” connecting them.

3 A Unified Representation of 3D Structure

The main goal of our dataset is to provide rich annotations of ground truth 3D structure. A naive way to do so is generating and storing different types of 3D annotations in the same format as existing works, like wireframes as in [9], planes as in [12], floorplans as in [14], and so on. But this leads to a lot of redundancy. For example, planar surfaces in man-made environments are often bounded by a number of line segments, which are part of the wireframe. Even worse, by representing wireframes and planes separately, the relationships between them is also lost.

In this paper, we present a unified representation of 3D structure in man-made environments, in order to minimize the redundancy in encoding multiple types of 3D structure, while preserving their mutual relationships. We show how most common types of structure previous studied in the literature (e.g., planes, cuboids, wireframes, room layouts, and floorplans) can be derived from our representation.

Our representation of structure is largely inspired by the early work of Witkin and Tenenbaum [24], which characterizes structure as “a shape, pattern, or configuration that replicates or continues with little or no change over an interval of space and time”. Accordingly, to describe any structure, we need to specify: (i) what pattern is continuing or replicating (e.g., a patch, an edge, or a texture descriptor), and (ii) the domain of its replication or continuation. In this paper, we call the former primitives and the latter relationships.

3.1 The “Primitive + Relationship” Representation

We now show how to describe a man-made environment using the “primitive + relationship” representation. For ease of exposition, we assume all objects in the scene can be modeled by piece-wise planar surfaces. But our representation can be easily extended to more general surfaces. An illustration of our representation is shown in Figure 3.

3.1.1 Primitives

Generally, a man-made environment consists of the following geometric primitives:

  • Planes 𝐏: We model the scene as a collection of planar surfaces 𝐏={p1,p2,} where each plane is described by its parameters p={𝐧,d}.

  • Lines 𝐋: When two planes intersect in the 3D space, a line is created. We use 𝐋={l1,l2,} to represent the set of all 3D lines in the scene.

  • Junction points 𝐗: When two lines meet in the 3D space, a junction point is formed. We use 𝐗={x1,x2,} to represent the set of all junction points.

3.1.2 Relationships

Next, we define some common types of relationships between the geometric primitives:

  • Plane-line relationships (R1): We use a matrix W1 to record all incidence and intersection relationships between planes in 𝐏 and lines in 𝐋. Specifically, the ij-th entry of W1 is 1 if li is on pj, and 0 otherwise. Note that two planes are intersected at some line if and only if the corresponding entry in W1TW1 is nonzero.

  • Line-point relationships (R2): Similarly, we use a matrix W2 to record all incidence and intersection relationships between lines in 𝐋 and points in 𝐗. Specifically, the mn-th entry of W2 is 1 if xm is on ln, and 0 otherwise. Note that two lines are intersected at some junction if and only if the corresponding entry in W2TW2 is nonzero.

  • Cuboids (R3): A cuboid is a special arrangement of plane primitives with rotational and reflection symmetry along x-, y- and z-axes. The corresponding symmetry group is the dihedral group D2h.

  • Manhattan world (R4): This is a special type of 3D structure commonly used for indoor and outdoor scene modeling. It can be viewed as a grouping relationship, in which all the plane primitives can be grouped into three classes, 𝐏1, 𝐏2, and 𝐏3, 𝐏=i=13𝐏i. Further, each class is represented by a single normal vector 𝐧i, such that 𝐧iT𝐧j=0,ij.

  • Semantic objects (R5): Semantic information is critical for many 3D computer vision tasks. It can be regarded as another type of grouping relationship, in which each semantic object instance corresponds to one or more primitives defined above. For example, each “wall”, “ceiling”, or “floor” instance is associated with one plane primitive; each “chair” instance is associated with a set of multiple plane primitives. Further, such a grouping is hierarchical. For example, we can further group one floor, one ceiling, and multiple walls to form a “living room” instance. And a “door” or a “window” is an opening which connects two rooms (or one room and the outer space).

Note that the relationships are not mutually exclusive, in the sense that a primitive can belong to multiple relationship instances of same type or different types. For example, a plane primitive can be shared by two cuboids, and at the same time belong to one of the three classes in the Manhattan world model.

3.1.3 Discussion

The primitives and relationships we discussed above are just a few most common examples. They are by no means exhaustive. For example, our representation can be easily extended to included other primitives such as parametric surfaces. And besides cuboids, there are many other types of regular or symmetric shapes in man-made environments, where type corresponds to a different symmetry group.

(a) (b)
Figure 4: Comparison of 3D house designs. (a): The 3D models in our database are created by professional designers using high-quality furniture models from world-leading manufacturers. Most designs are being used in real-world production. (b): The 3D models in SUNCG dataset [20] are created using Planner 5D [1], an online tool for amateur interior design.

3.2 Relation to Existing Models

Given our representation which contains primitives 𝒫={𝐏,𝐋,𝐗} and relationships ={R1,R2,}, we show how several types of 3D structure commonly studied in the literature can be derived from it. We again refer readers to Figure 2 for illustrations of these structures.

Planes: A large volume of studies in the literature model the scene as a collection of 3D planes, where each plane is represented by its parameters and boundary. To generate such a model, we simply use the plane primitives 𝐏. For each p𝐏, we further obtain its boundary by using matrix W1 in R1 to find all the lines in 𝐋 that form an incidence relationship with p.

Wireframes: A wireframe consists of lines 𝐋 and junction points 𝐏, and their incidence and intersection relationships (R2).

Cuboids: This model is same as R3.

Manhattan layouts: A Manhattan room layout model includes a “room” as defined in R5 which also satisfies the Manhattan world assumption (R4).

Floorplans: A floorplan is a 2D vector representation which consists of a set of line segments and semantic labels (e.g., room types). To obtain such a vector representation, we can identify all lines in 𝐋 and junction points in 𝐗 which lie on a “floor” (as defined in R5). To further obtain the semantic room labels, we can project all “rooms”, “doors”, and “windows” (as defined in R5) to this floor.

Abstracted 3D shapes: In addition to room structures, our representation can also be applied to individual 3D object models to create abstractions in the form of wireframes or cuboids, as described above.

4 The Structured3D Dataset

(a) Original (b) Simple configuration (c) Empty configuration
(d) Lighting (e) Depth (f) Semantic labels
Figure 5: Examples of our rendered panoramic images.

Our general, unified representation enables us to encodes a rich set of geometric primitives and relationships for structured 3D modeling. With this representation, our ultimate goal is to build a dataset which can be used to train machines to achieve the human-level understanding of the 3D environment.

As a first step towards this goal, in this section, we describe our on-going effort to create a large-scale dataset of indoor scenes which include (i) ground truth 3D structure annotations of the scene and (ii) realistic 2D renderings of the scene. Note that in this work we focus on extracting ground truth annotations on the room structure only. We plan to extend our dataset to include 3D structure annotations of individual furniture models in the future.

4.1 Extraction of Structured 3D Models

To extract a “primitive + relationship” representation of the 3D scene, we make use of a large database of over one million house designs hand-crafted by professional designers. An example design is shown in Figure 4(a). All information of the design is stored in an industry-standard format in the database so that specifications about the geometry (e.g., the precise length, width, and height of each wall), textures and materials, and functions (e.g., which room the wall belongs to) of all objects can be easily retrieved.

From the database, we have selected 3,500 house designs with about 21,854 rooms. We created a computer program to automatically extract all the geometric primitives associated with the room structure, which consists of the ceiling, floor, walls, and openings (doors and windows). Given the precise measurements and associated information of these entities in the database, it is straightforward to generate all planes, lines, and junctions, as well as their relationships (R1 and R2).

Since the measurements are highly accurate and noise-free, other types of relationship such a Manhattan world (R3) and cuboids (R4) can also be easily obtained by clustering the primitives, followed by a geometric verification process. Finally, to include semantic information (R5) into our representation, we simply map the relevant labels provided by the professional designers to the geometric primitives in our representation. Figure 3 shows examples of the extracted geometric primitives and relationships.

4.2 Photo-realistic 2D Rendering

We have developed a photo-realistic renderer on top of Embree [23], an open-source collection of ray-tracing kernels for x86 CPUs. Our renderer uses the well-known path tracing [17] method, a Monte Carlo approach to approximating realistic Global Illumination (GI) for rendering.

Each room is manually created by professional designers with over one million CAD models of furniture from world-leading manufacturers. These high-resolution furniture models are measured in real-world dimensions and being used in real production. A default lighting setup is also provided for each room. Figure 4 compares the 3D models in our database with those in the public SUNCG dataset [20], which are created using Planner 5D [1], an online tool for amateur interior design.

At the time of rendering, a panorama or pin-hole camera is placed at random locations not occupied by objects in the room. We use 1024×512 resolution for panoramas and 640×480 for perspective images. Figure 5 shows example panoramas rendered by our engine. For each room, we generate a few different configurations (full, simple, and empty) by removing some or all the furniture. We also modify the lighting setup to generate images with different temperature. Besides images, our dataset also includes the corresponding depth maps and semantic labels, as they may be useful either as inputs to machine learning algorithms or to help multi-task learning. Figure 6 further illustrates the degree of photo-realism of our dataset, where we compare the rendered images with photos of real decoration guided by the design.

We would like to emphasize the potential of our dataset in terms of extension capabilities. As we mentioned before, the unified representation enables us to include many other types of structure in the dataset. As for 2D rendering, depending on the application, we can easily simulate different effects such as lighting conditions, fisheye and novel camera designs, motion blur, and imaging noise. Furthermore, the dataset may be extended to include videos for applications like floorplan reconstruction [13] and visual SLAM [4].

Figure 6: Photo-realistic rendering vs. real-world decoration. We encourage readers to guess which column corresponds to real-world decoration. The answer is in the footnote\colorred1\colorblack.
11footnotetext: Right: real-world decoration.

5 Experiments

To demonstrate the benefits of our new dataset, we use it to train deep neural networks for room layout estimation, an important task in structured 3D modeling.

5.1 Experiment Setup

Real dataset. We use the same dataset as LayoutNet [34]. The dataset consists of images from PanoContext [31] and 2D-3D-S [2], including 818 training images, 79 validation images, and 166 test images. Note that both datasets only provide cuboid-shape layout annotations.

Our Structured3D dataset. In this experiment, we use a subset of panoramas with the original lighting and configuration. Each panorama corresponds to a different room in our dataset. We show statistics of different room layouts in our dataset in Table 2. Since current real dataset only contains cuboid-shape layout annotations (i.e., 4 corners), we choose 12k panoramic images with the cuboid-shape layout in our dataset. We split the images into 10k for training, 1k for validation, and 1k for testing.

Evaluation metrics. Following [34, 21], we adopt three standard metrics: i) 3D IoU: intersection over union between predicted 3D layout and the ground truth, ii) Corner Error (CE): Normalized 2 distance between predicted corner and ground truth, and iii) Pixel Error (PE): pixel-wise error between predicted plane classes and ground truth.

Baselines. We choose two recent CNN-based approaches, LayoutNet [34]22 2 and HorizonNet [21]33 3, based on their performance and source code availability. LayoutNet uses a CNN to predict a corner probability map and a boundary map from the panorama and vanishing lines, then optimizes the layout parameters based on network predictions. HorizonNet represents room layout as three 1D vectors, i.e., boundary positions of floor-wall, and ceiling wall, and existence of wall-wall boundary. It trains CNNs to directly predict the three 1D vectors. In this paper, we follow the default training setting of the respective methods and stop the training once the loss converges on the validation set.

Table 2: Room layout statistics.
# Corners 4 5 6 7 8 9 10+ Total
Realtor360 1246 0 950 0 316 0 61 2573
Structured3D 13743 52 3727 30 1575 17 2691 21835

5.2 Experiment Results

We have conduct several sets of experiments to measure the usefulness of our synthetic dataset.

Table 3: Quantitative evaluation under different training schemes. The best and the second best results are boldfaced and underlined, respectively. *: The results are reported in the original papers of corresponding methods.
Methods Configuration PanoContext 2D-3D-S
3D IoU (%) CE (%) PE (%) 3D IoU (%) CE (%) PE (%)
LayoutNet [34] s 71.53 1.25 4.04 64.97 1.62 4.41
r 73.78 1.09 3.50 76.64 0.90 2.90
r* 75.12 1.02 3.18 77.51 0.92 2.42
s r 77.32 0.90 2.81 77.99 0.90 2.77
HorizonNet [21] s 76.33 1.04 3.12 72.00 1.09 3.78
r 82.87 0.73 2.06 83.26 0.64 2.07
r* 84.23 0.69 1.90 83.51 0.62 1.97
s r 84.37 0.64 1.89 86.01 0.78 2.07

Impact of synthetic data. In this experiment, we train LayoutNet and HorizonNet in three different manners: i) training only on our synthetic dataset (“s”), ii) training only on the real dataset (“r”), and iii) pre-training on our synthetic dataset, then fine-tuning on the real dataset (“s r”). We adopt the training set of LayoutNet as the real dataset in this experiment. The results are shown in Table 3, in which we also include the numbers reported in the original papers (“r*”). As one can see, the use of synthetic data for pre-training can boost the performance of both networks. We refer readers to supplementary materials for more qualitative results.

Table 4: Quantitative evaluation using varying synthetic data size in pre-training. The best and the second best results are boldfaced and underlined, respectively.
Methods Synthetic PanoContext 2D-3D-S
Data Size 3D IoU (%) CE (%) PE (%) 3D IoU (%) CE (%) PE (%)
LayoutNet [34] 2.5k 75.80 0.94 2.95 77.17 0.82 2.64
5k 76.41 0.91 2.80 76.76 0.88 2.89
10k 77.32 0.90 2.81 77.99 0.90 2.77
HorizonNet [21] 2.5k 84.33 0.64 1.80 83.31 0.84 2.30
5k 83.67 0.69 1.95 85.50 0.64 2.01
10k 84.37 0.64 1.89 86.01 0.78 2.07

Performance vs. synthetic data size. We further study the relationship between the number of synthetic images used in pre-training and the accuracy on the real dataset. We sample 2.5k, 5k and 10k synthetic images for pre-training, then fine-tune the model on the real dataset. The results are shown in Table 4. As expected, using more synthetic data generally improves the performance.

Table 5: The generalizability of synthetic and real datasets.
Methods Train Set 3D IoU (%) CE (%) PE (%)
LayoutNet Ours 71.53 1.25 4.04
[34] 2D-3D-S 60.28 2.82 6.96
HorizonNet Ours 76.33 1.04 3.12
[21] 2D-3D-S 50.40 4.59 8.81
LayoutNet Ours 64.97 1.62 4.41
[34] PC 65.46 1.61 5.46
HorizonNet Ours 72.00 1.09 3.78
[21] PC 71.26 1.57 3.90

Generalization to different domains. To compare the generalizability of the models trained on the synthetic dataset and the real dataset, we conduct experiments in two different configurations: i) training on our synthetic data, and ii) training on one real dataset. Then we test both models on the other real dataset. Note that the data used in LayoutNet is from two domains, i.e. PanoContext (PC) and 2D-3D-S. In this experiment, we use the two datasets separately.

As shown in Table 5, when tested on PanoContext, the model trained on our data significantly outperforms the one trained on 2D-3D-S. When tested on 2D-3D-S, the model trained on our data is competitive with or slightly better than the one trained on PanoContext. Note that our dataset and PanoContext both focus on residential scenes, whereas images in 2D-3D-S are taken from office areas.

Limitation of real datasets. Due to human errors, the annotation in real datasets is not always consistent with the actual room layout. In the left image of Figure 7, the room is a non-cuboid shape layout, but the ground truth layout is labeled as cuboid-shape. In the right image, the front wall is not labeled as ground truth. These examples illustrate the limitation of using real datasets as benchmarks. We avoid such errors in our dataset by automatically generating ground truth from the original design files.

Figure 7: Limitation of real datasets. Left from PanoContext dataset, right from 2D-3D-S dataset. The blue lines are ground truth layout and the green lines are predictions.

6 Conclusion

In this paper, we present Structured3D, a large synthetic dataset with rich ground truth 3D structure annotations and photo-realistic 2D renderings. We view this work as an important and exciting step towards building intelligent machines which can achieve human-level holistic 3D scene understanding: The unified “primitive+relationship” representation enables us to efficiently capture a wide variety of 3D structures and their relations, whereas the availability of millions of professional interior designs makes it possible to generate virtually unlimited amount of photo-realistic images and videos. In the future, we will continue to add more 3D structure annotations of the scenes and objects to the dataset, and explore novel ways to use the dataset to advance techniques for structured 3D modeling and understanding.


We would like to thank for providing the database of house designs and the rendering engine. This work was supported by NSFC #61502304. Zihan Zhou was supported by NSF award #1815491.


  • [1] Planner 5d.
  • [2] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2d-3d-semantic data for indoor scene understanding. CoRR, abs/1702.01105, 2017.
  • [3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. K. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In CVPR, pages 1534–1543, 2016.
  • [4] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. D. Reid, and J. J. Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robotics, 32(6):1309–1332, 2016.
  • [5] A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In 3DV, pages 667–676, 2017.
  • [6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
  • [7] D. Dwibedi, T. Malisiewicz, V. Badrinarayanan, and A. Rabinovich. Deep cuboid detection: Beyond 2d bounding boxes. CoRR, abs/1611.10010, 2016.
  • [8] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. A papier-mâché approach to learning 3d surface generation. In CVPR, pages 216–224, 2018.
  • [9] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma. Learning to parse wireframes in images of man-made environments. In CVPR, pages 626–635, 2018.
  • [10] C. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: End-to-end room layout estimation. In ICCV, pages 4875–4884, 2017.
  • [11] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. In BMVC, page 77, 2018.
  • [12] C. Liu, K. Kim, J. Gu, Y. Furukawa, and J. Kautz. Planercnn: 3d plane detection and reconstruction from a single image. In CVPR, pages 4450–4459, 2019.
  • [13] C. Liu, J. Wu, and Y. Furukawa. Floornet: A unified framework for floorplan reconstruction from 3d scans. In ECCV, pages 203–219, 2018.
  • [14] C. Liu, J. Wu, P. Kohli, and Y. Furukawa. Raster-to-vector: Revisiting floorplan transformation. In ICCV, pages 2214–2222, 2017.
  • [15] C. Liu, J. Yang, D. Ceylan, E. Yumer, and Y. Furukawa. Planenet: Piece-wise planar reconstruction from a single rgb image. In CVPR, pages 2579–2588, 2018.
  • [16] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison. Scenenet RGB-D: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, pages 2697–2706, 2017.
  • [17] T. J. Purcell, I. Buck, W. R. Mark, and P. Hanrahan. Ray tracing on programmable graphics hardware. ACM Trans. Graph., 21(3):703–712, 2002.
  • [18] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, pages 746–760, 2012.
  • [19] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In CVPR, pages 567–576, 2015.
  • [20] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser. Semantic scene completion from a single depth image. In CVPR, pages 1746–1754, 2017.
  • [21] C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen. Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In CVPR, pages 1047–1056, 2019.
  • [22] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by assembling volumetric primitives. In CVPR, pages 2635–2643, 2017.
  • [23] I. Wald, S. Woop, C. Benthin, G. S. Johnson, and M. Ernst. Embree: a kernel framework for efficient CPU ray tracing. ACM Trans. Graph., 33(4):143:1–143:8, 2014.
  • [24] A. P. Witkin and J. M. Tenenbaum. On the role of structure in vision. In J. Beck, B. Hope, and A. Rosenfeld, editors, Human and Machine Vision, pages 481–543. Academic Press, 1983.
  • [25] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. 3d interpreter networks for viewer-centered wireframe modeling. IJCV, 126(9):1009–1026, 2018.
  • [26] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In CVPR, pages 2695–2702, 2012.
  • [27] J. Xiao, B. Russell, and A. Torralba. Localizing 3d cuboids in single-view images. In NeurIPS, pages 746–754, 2012.
  • [28] F. Yang and Z. Zhou. Recovering 3d planes from a single image via convolutional neural networks. In ECCV, pages 87–103, 2018.
  • [29] S. Yang, F. Wang, C. Peng, P. Wonka, M. Sun, and H. Chu. Dula-net: A dual-projection network for estimating room layouts from a single rgb panorama. In CVPR, pages 3363–3372, 2019.
  • [30] Z. Yu, J. Zheng, D. Lian, Z. Zhou, and S. Gao. Single-image piece-wise planar 3d reconstruction via associative embedding. In CVPR, pages 1029–1037, 2019.
  • [31] Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In ECCV, pages 668–686, 2014.
  • [32] Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser. Physically-based rendering for indoor scene understanding using convolutional neural networks. In CVPR, pages 5287–5295, 2017.
  • [33] Y. Zhang, F. Yu, S. Song, P. Xu, A. Seff, and J. Xiao. Large-scale scene understanding challenge: Room layout estimation. 2016.
  • [34] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single RGB image. In CVPR, pages 2051–2059, 2018.