Abstract
Hallucination, the generation of factually incorrect content, is a growingchallenge in Large Language Models (LLMs). Existing detection and mitigationmethods are often isolated and insufficient for domain-specific needs, lackinga standardized pipeline. This paper introduces THaMES (Tool for HallucinationMitigations and EvaluationS), an integrated framework and library addressingthis gap. THaMES offers an end-to-end solution for evaluating and mitigatinghallucinations in LLMs, featuring automated test set generation, multifacetedbenchmarking, and adaptable mitigation strategies. It automates test setcreation from any corpus, ensuring high data quality, diversity, andcost-efficiency through techniques like batch processing, weighted sampling,and counterfactual validation. THaMES assesses a model's ability to detect andreduce hallucinations across various tasks, including text generation andbinary classification, applying optimal mitigation strategies like In-ContextLearning (ICL), Retrieval Augmented Generation (RAG), and Parameter-EfficientFine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge baseof academic papers, political news, and Wikipedia reveal that commercial modelslike GPT-4o benefit more from RAG than ICL, while open-weight models likeLlama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFTsignificantly enhances the performance of Llama-3.1-8B-Instruct in bothevaluation tasks.