Labeled datasets for semantic segmentation are imperfect, especially inmedical imaging where borders are often subtle or ill-defined. Little work hasbeen done to analyze the effect that label errors have on the performance ofsegmentation methodologies. Here we present a large-scale study of modelperformance in the presence of varying types and degrees of error in trainingdata. We trained U-Net, SegNet, and FCN32 several times for liver segmentationwith 10 different modes of ground-truth perturbation. Our results show that foreach architecture, performance steadily declines with boundary-localizederrors, however, U-Net was significantly more robust to jagged boundary errorsthan the other architectures. We also found that each architecture was veryrobust to non-boundary-localized errors, suggesting that boundary-localizederrors are fundamentally different and more challenging problem than randomlabel errors in a classification setting.