Abstract
Progress in machine learning has been driven in large part by massiveincreases in data. However, large web-scale datasets such as LAION are largelyuncurated beyond searches for exact duplicates, potentially leaving muchredundancy. Here, we introduce SemDeDup, a method which leverages embeddingsfrom pre-trained models to identify and remove semantic duplicates: data pairswhich are semantically similar, but not exactly identical. Removing semanticduplicates preserves performance and speeds up learning. Analyzing a subset ofLAION, we show that SemDeDup can remove 50% of the data with minimalperformance loss, effectively halving training time. Moreover, performanceincreases out of distribution. Also, analyzing language models trained on C4, apartially curated dataset, we show that SemDeDup improves over prior approacheswhile providing efficiency gains. SemDeDup provides an example of how simpleways of leveraging quality embeddings can be used to make models learn fasterwith less data.