Leveraging Multi-Modal Information to Enhance Dataset Distillation

Abstract

Dataset distillation aims to create a compact and highly representativesynthetic dataset that preserves the knowledge of a larger real dataset. Whileexisting methods primarily focus on optimizing visual representations,incorporating additional modalities and refining object-level information cansignificantly improve the quality of distilled datasets. In this work, weintroduce two key enhancements to dataset distillation: caption-guidedsupervision and object-centric masking. To integrate textual information, wepropose two strategies for leveraging caption features: the featureconcatenation, where caption embeddings are fused with visual features at theclassification stage, and caption matching, which introduces a caption-basedalignment loss during training to ensure semantic coherence between real andsynthetic data. Additionally, we apply segmentation masks to isolate targetobjects and remove background distractions, introducing two loss functionsdesigned for object-centric learning: masked feature alignment loss and maskedgradient matching loss. Comprehensive evaluations demonstrate that integratingcaption-based guidance and object-centric masking enhances datasetdistillation, leading to synthetic datasets that achieve superior performanceon downstream tasks.

Quick Read (beta)

loading the full paper ...