Abstract
Many real-world vision problems suffer from inherent ambiguities. In clinicalapplications for example, it might not be clear from a CT scan alone whichparticular region is cancer tissue. Therefore a group of graders typicallyproduces a set of diverse but plausible segmentations. We consider the task oflearning a distribution over segmentations given an input. To this end wepropose a generative segmentation model based on a combination of a U-Net witha conditional variational autoencoder that is capable of efficiently producingan unlimited number of plausible hypotheses. We show on a lung abnormalitiessegmentation task and on a Cityscapes segmentation task that our modelreproduces the possible segmentation variants as well as the frequencies withwhich they occur, doing so significantly better than published approaches.These models could have a high impact in real-world applications, such as beingused as clinical decision-making algorithms accounting for multiple plausiblesemantic segmentation hypotheses to provide possible diagnoses and recommendfurther actions to resolve the present ambiguities.