Abstract
The increasing depth of parametric domain knowledge in large language models(LLMs) is fueling their rapid deployment in real-world applications.Understanding model vulnerabilities in high-stakes and knowledge-intensivetasks is essential for quantifying the trustworthiness of model predictions andregulating their use. The recent discovery of named entities as adversarialexamples (i.e. adversarial entities) in natural language processing tasksraises questions about their potential impact on the knowledge robustness ofpre-trained and finetuned LLMs in high-stakes and specialized domains. Weexamined the use of type-consistent entity substitution as a template forcollecting adversarial entities for billion-parameter LLMs with biomedicalknowledge. To this end, we developed an embedding-space attack based onpowerscaled distance-weighted sampling to assess the robustness of theirbiomedical knowledge with a low query budget and controllable coverage. Ourmethod has favorable query efficiency and scaling over alternative approachesbased on random sampling and blackbox gradient-guided search, which wedemonstrated for adversarial distractor generation in biomedical questionanswering. Subsequent failure mode analysis uncovered two regimes ofadversarial entities on the attack surface with distinct characteristics and weshowed that entity substitution attacks can manipulate token-wise Shapley valueexplanations, which become deceptive in this setting. Our approach complementsstandard evaluations for high-capacity models and the results highlight thebrittleness of domain knowledge in LLMs.