Recent years have seen a rapid growth in new approaches improving theaccuracy of semantic segmentation in a weakly supervised setting, i.e. withonly image-level labels available for training. However, this has come at thecost of increased model complexity and sophisticated multi-stage trainingprocedures. This is in contrast to earlier work that used only a single stage$-$ training one segmentation network on image labels $-$ which was abandoneddue to inferior segmentation accuracy. In this work, we first define threedesirable properties of a weakly supervised method: local consistency, semanticfidelity, and completeness. Using these properties as guidelines, we thendevelop a segmentation-based network model and a self-supervised trainingscheme to train for semantic masks from image-level annotations in a singlestage. We show that despite its simplicity, our method achieves results thatare competitive with significantly more complex pipelines, substantiallyoutperforming earlier single-stage methods.