An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Abstract

Few ideas have enjoyed as large an impact on deep learning as convolution.For any problem involving pixels or spatial representations, common intuitionholds that convolutional neural networks may be appropriate. In this paper weshow a striking counterexample to this intuition via the seemingly trivialcoordinate transform problem, which simply requires learning a mapping betweencoordinates in (x,y) Cartesian space and one-hot pixel space. Althoughconvolutional networks would seem appropriate for this task, we show that theyfail spectacularly. We demonstrate and carefully analyze the failure first on atoy problem, at which point a simple fix becomes obvious. We call this solutionCoordConv, which works by giving convolution access to its own inputcoordinates through the use of extra coordinate channels. Without sacrificingthe computational and parametric efficiency of ordinary convolution, CoordConvallows networks to learn either perfect translation invariance or varyingdegrees of translation dependence, as required by the task. CoordConv solvesthe coordinate transform problem with perfect generalization and 150 timesfaster with 10--100 times fewer parameters than convolution. This starkcontrast raises the question: to what extent has this inability of convolutionpersisted insidiously inside other tasks, subtly hampering performance fromwithin? A complete answer to this question will require further investigation,but we show preliminary evidence that swapping convolution for CoordConv canimprove models on a diverse set of tasks. Using CoordConv in a GAN producedless mode collapse as the transform between high-level spatial latents andpixels becomes easier to learn. A Faster R-CNN detection model trained on MNISTdetection showed 24% better IOU when using CoordConv, and in the RL domainagents playing Atari games benefit significantly from the use of CoordConvlayers.

Quick Read (beta)

loading the full paper ...