Dataset Distillation: A Comprehensive Review

Abstract

Recent success of deep learning can be largely attributed to the huge amountof data used for training deep neural networks. However, the sheer amount ofdata significantly increase the burden on storage and transmission. It wouldalso consume considerable time and computational resources to train models onsuch large datasets. Moreover, directly publishing raw data inevitably raiseconcerns on privacy and copyright. Focusing on these inconveniences, datasetdistillation (DD), also known as dataset condensation (DC), has become apopular research topic in recent years. Given an original large dataset, DDaims at a much smaller dataset containing several synthetic samples, such thatmodels trained on the synthetic dataset can have comparable performance withthose trained on the original real one. This paper presents a comprehensivereview and summary for recent advances in DD and its application. We firstintroduce the task in formal and propose an overall algorithmic frameworkfollowed by all existing DD methods. Then, we provide a systematic taxonomy ofcurrent methodologies in this area. Their theoretical relationship will also bediscussed. We also point out current challenges in DD through extensiveexperiments and envision possible directions for future works.

Quick Read (beta)

loading the full paper ...