FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Abstract

With the significant advancements made in generation of forged video andaudio, commonly known as deepfakes, using deep learning technologies, theproblem of its misuse is a well-known issue now. Recently, a new problem ofgenerating cloned or synthesized human voice of a person is emerging. AI-baseddeep learning models can synthesize any person's voice requiring just a fewseconds of audio. With the emerging threat of impersonation attacks usingdeepfake videos and audios, new deepfake detectors are need that focuses onboth, video and audio. Detecting deepfakes is a challenging task andresearchers have made numerous attempts and proposed several deepfake detectionmethods. To develop a good deepfake detector, a handsome amount of good qualitydataset is needed that captures the real world scenarios. Many researchers havecontributed in this cause and provided several deepfake dataset, self generatedand in-the-wild. However, almost all of these datasets either contains deepfakevideos or audio. Moreover, the recent deepfake datasets proposed by researchershave racial bias issues. Hence, there is a crucial need of a good deepfakevideo and audio deepfake dataset. To fill this gap, we propose a novelAudio-Video Deepfake dataset (FakeAVCeleb) that not only contains deepfakevideos but respective synthesized cloned audios as well. We generated ourdataset using recent most popular deepfake generation methods and the videosand audios are perfectly lip-synced with each other. To generate a morerealistic dataset, we selected real YouTube videos of celebrities having fourracial backgrounds (Caucasian, Black, East Asian and South Asian) to counterthe racial bias issue. Lastly, we propose a novel multimodal detection methodthat detects deepfake videos and audios based on our multimodal Audio-Videodeepfake dataset.

Quick Read (beta)

loading the full paper ...