Abstract
There is growing evidence that converting targets to soft targets insupervised learning can provide considerable gains in performance. Much of thiswork has considered classification, converting hard zero-one values to softlabels---such as by adding label noise, incorporating label ambiguity or usingdistillation. In parallel, there is some evidence from a regression setting inreinforcement learning that learning distributions can improve performance. Inthis work, we investigate the reasons for this improvement, in a regressionsetting. We introduce a novel distributional regression loss, and similarlyfind it significantly improves prediction accuracy. We investigate severalcommon hypotheses, around reducing overfitting and improved representations. Weinstead find evidence for an alternative hypothesis: this loss is easier tooptimize, with better behaved gradients, resulting in improved generalization.We provide theoretical support for this alternative hypothesis, bycharacterizing the norm of the gradients of this loss.