Semi-Unsupervised Learning: Clustering and Classifying using Ultra-Sparse Labels

Abstract

In semi-supervised learning for classification, it is assumed that everyground truth class of data is present in the small labelled dataset. Manyreal-world sparsely-labelled datasets are plausibly not of this type. It couldeasily be the case that some classes of data are found only in the unlabelleddataset -- perhaps the labelling process was biased -- so we do not have anylabelled examples to train on for some classes. We call this learning regimesemi-unsupervised learning, an extreme case of semi-supervised learning, wheresome classes have no labelled exemplars in the training set. First, we outlinethe pitfalls associated with trying to apply deep generative model (DGM)-basedsemi-supervised learning algorithms to datasets of this type. We then show howa combination of clustering and semi-supervised learning, using DGMs, can bebrought to bear on this problem. We study several different datasets, showinghow one can still learn effectively when half of the ground truth classes areentirely unlabelled and the other half are sparsely labelled.

Quick Read (beta)

loading the full paper ...