A significant proportion of individuals' daily activities is experiencedthrough digital devices. Smartphones in particular have become one of thepreferred interfaces for content consumption and social interaction.Identifying the content embedded in frequently-captured smartphone screenshotsis thus a crucial prerequisite to studies of media behavior and healthintervention planning that analyze activity interplay and content switchingover time. Screenshot images can depict heterogeneous contents andapplications, making the a priori definition of adequate taxonomies acumbersome task, even for humans. Privacy protection of the sensitive datacaptured on screens means the costs associated with manual annotation arelarge, as the effort cannot be crowd-sourced. Thus, there is need to examineutility of unsupervised and semi-supervised methods for digital screenshotclassification. This work introduces the implications of applying clustering onlarge screenshot sets when only a limited amount of labels is available. Inthis paper we develop a framework for combining K-Means clustering with ActiveLearning for efficient leveraging of labeled and unlabeled samples, with thegoal of discovering latent classes and describing a large collection ofscreenshot data. We tested whether SVM-embedded or XGBoost-embedded solutionsfor class probability propagation provide for more well-formed clusterconfigurations. Visual and textual vector representations of the screenshotimages are derived and combined to assess the relative contribution ofmulti-modal features to the overall performance.