Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Abstract

Text-to-image (T2I) diffusion models have rapidly advanced, enablinghigh-quality image generation conditioned on textual prompts. However, thegrowing trend of fine-tuning pre-trained models for personalization raisesserious concerns about unauthorized dataset usage. To combat this, datasetownership verification (DOV) has emerged as a solution, embedding watermarksinto the fine-tuning datasets using backdoor techniques. These watermarksremain inactive under benign samples but produce owner-specified outputs whentriggered. Despite the promise of DOV for T2I diffusion models, its robustnessagainst copyright evasion attacks (CEA) remains unexplored. In this paper, weexplore how attackers can bypass these mechanisms through CEA, allowing modelsto circumvent watermarks even when trained on watermarked datasets. We proposethe first copyright evasion attack (i.e., CEAT2I) specifically designed toundermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises threestages: watermarked sample detection, trigger identification, and efficientwatermark mitigation. A key insight driving our approach is that T2I modelsexhibit faster convergence on watermarked samples during the fine-tuning,evident through intermediate feature deviation. Leveraging this, CEAT2I canreliably detect the watermarked samples. Then, we iteratively ablate tokensfrom the prompts of detected watermarked samples and monitor shifts inintermediate features to pinpoint the exact trigger tokens. Finally, we adopt aclosed-form concept erasure method to remove the injected watermark. Extensiveexperiments show that our CEAT2I effectively evades DOV mechanisms whilepreserving model performance.

Quick Read (beta)

loading the full paper ...