Multi-scale inference is commonly used to improve the results of semanticsegmentation. Multiple images scales are passed through a network and then theresults are combined with averaging or max pooling. In this work, we present anattention-based approach to combining multi-scale predictions. We show thatpredictions at certain scales are better at resolving particular failuresmodes, and that the network learns to favor those scales for such cases inorder to generate better predictions. Our attention mechanism is hierarchical,which enables it to be roughly 4x more memory efficient to train than otherrecent approaches. In addition to enabling faster training, this allows us totrain with larger crop sizes which leads to greater model accuracy. Wedemonstrate the result of our method on two datasets: Cityscapes and MapillaryVistas. For Cityscapes, which has a large number of weakly labelled images, wealso leverage auto-labelling to improve generalization. Using our approach weachieve a new state-of-the-art results in both Mapillary (61.1 IOU val) andCityscapes (85.1 IOU test).