Abstract
Multi-task convolutional neural networks (CNNs) have shown impressive resultsfor certain combinations of tasks, such as single-image depth estimation (SIDE)and semantic segmentation. This is achieved by pushing the network towardslearning a robust representation that generalizes well to different atomictasks. We extend this concept by adding auxiliary tasks, which are of minorrelevance for the application, to the set of learned tasks. As a kind ofadditional regularization, they are expected to boost the performance of theultimately desired main tasks. To study the proposed approach, we pickedvision-based road scene understanding (RSU) as an exemplary application. Sincemulti-task learning requires specialized datasets, particularly when usingextensive sets of tasks, we provide a multi-modal dataset for multi-task RSU,called synMT. More than 2.5 $\cdot$ 10^5 synthetic images, annotated with 21different labels, were acquired from the video game Grand Theft Auto V (GTA V).Our proposed deep multi-task CNN architecture was trained on variouscombination of tasks using synMT. The experiments confirmed that auxiliarytasks can indeed boost network performance, both in terms of final results andtraining time.