Abstract
Adding explanations to audio deepfake detection (ADD) models will boost theirreal-world application by providing insight on the decision making process. Inthis paper, we propose a relevancy-based explainable AI (XAI) method to analyzethe predictions of transformer-based ADD models. We compare against standardGrad-CAM and SHAP-based methods, using quantitative faithfulness metrics aswell as a partial spoof test, to comprehensively analyze the relativeimportance of different temporal regions in an audio. We consider largedatasets, unlike previous works where only limited utterances are studied, andfind that the XAI methods differ in their explanations. The proposedrelevancy-based XAI method performs the best overall on a variety of metrics.Further investigation on the relative importance of speech/non-speech, phoneticcontent, and voice onsets/offsets suggest that the XAI results obtained fromanalyzing limited utterances don't necessarily hold when evaluated on largedatasets.