Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability

  • 2025-06-16 18:54:28
  • Shova Kuikel, Aritran Piplai, Palvi Aggarwal
  • 0

Abstract

Phishing attacks remain one of the most prevalent and persistentcybersecurity threat with attackers continuously evolving and intensifyingtactics to evade the general detection system. Despite significant advances inartificial intelligence and machine learning, faithfully reproducing theinterpretable reasoning with classification and explainability that underpinphishing judgments remains challenging. Due to recent advancement in NaturalLanguage Processing, Large Language Models (LLMs) show a promising directionand potential for improving domain specific phishing classification tasks.However, enhancing the reliability and robustness of classification modelsrequires not only accurate predictions from LLMs but also consistent andtrustworthy explanations aligning with those predictions. Therefore, a keyquestion remains: can LLMs not only classify phishing emails accurately butalso generate explanations that are reliably aligned with their predictions andinternally self-consistent? To answer these questions, we have fine-tunedtransformer based models, including BERT, Llama models, and Wizard, to improvedomain relevance and make them more tailored to phishing specific distinctions,using Binary Sequence Classification, Contrastive Learning (CL) and DirectPreference Optimization (DPO). To that end, we examined their performance inphishing classification and explainability by applying the ConsistenCy measurebased on SHAPley values (CC SHAP), which measures prediction explanation tokenalignment to test the model's internal faithfulness and consistency and uncoverthe rationale behind its predictions and reasoning. Overall, our findings showthat Llama models exhibit stronger prediction explanation token alignment withhigher CC SHAP scores despite lacking reliable decision making accuracy,whereas Wizard achieves better prediction accuracy but lower CC SHAP scores.

 

Quick Read (beta)

loading the full paper ...