The Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition

Abstract

Over the recent years, the advancements in deep face recognition have fueledan increasing demand for large and diverse datasets. Nevertheless, theauthentic data acquired to create those datasets is typically sourced from theweb, which, in many cases, can lead to significant privacy issues due to thelack of explicit user consent. Furthermore, obtaining a demographicallybalanced, large dataset is even more difficult because of the natural imbalancein the distribution of images from different demographic groups. In this paper,we investigate the impact of demographically balanced authentic and syntheticdata, both individually and in combination, on the accuracy and fairness offace recognition models. Initially, several generative methods were used tobalance the demographic representations of the corresponding syntheticdatasets. Then a state-of-the-art face encoder was trained and evaluated using(combinations of) synthetic and authentic images. Our findings emphasized twomain points: (i) the increased effectiveness of training data generated bydiffusion-based models in enhancing accuracy, whether used alone or combinedwith subsets of authentic data, and (ii) the minimal impact of incorporatingbalanced data from pre-trained generative methods on fairness (in nearly alltested scenarios using combined datasets, fairness scores remained eitherunchanged or worsened, even when compared to unbalanced authentic datasets).Source code and data are available at \url{https://cutt.ly/AeQy1K5G} forreproducibility.

Quick Read (beta)

loading the full paper ...