Abstract
Despite the success of Large Language Models (LLMs) across various fields,their potential to generate untruthful, biased and harmful responses posessignificant risks, particularly in critical applications. This highlights theurgent need for systematic methods to detect and prevent such misbehavior.While existing approaches target specific issues such as harmful responses,this work introduces LLMScan, an innovative LLM monitoring technique based oncausality analysis, offering a comprehensive solution. LLMScan systematicallymonitors the inner workings of an LLM through the lens of causal inference,operating on the premise that the LLM's `brain' behaves differently whenmisbehaving. By analyzing the causal contributions of the LLM's input tokensand transformer layers, LLMScan effectively detects misbehavior. Extensiveexperiments across various tasks and models reveal clear distinctions in thecausal distributions between normal behavior and misbehavior, enabling thedevelopment of accurate, lightweight detectors for a variety of misbehaviordetection tasks.