Abstract
Large Language Models (LLMs) are powerful zero-shot assessors and areincreasingly used in real-world situations such as for written exams orbenchmarking systems. Despite this, no existing work has analyzed thevulnerability of judge-LLMs against adversaries attempting to manipulateoutputs. This work presents the first study on the adversarial robustness ofassessment LLMs, where we search for short universal phrases that when appendedto texts can deceive LLMs to provide high assessment scores. Experiments onSummEval and TopicalChat demonstrate that both LLM-scoring and pairwiseLLM-comparative assessment are vulnerable to simple concatenation attacks,where in particular LLM-scoring is very susceptible and can yield maximumassessment scores irrespective of the input text quality. Interestingly, suchattacks are transferable and phrases learned on smaller open-source LLMs can beapplied to larger closed-source models, such as GPT3.5. This highlights thepervasive nature of the adversarial vulnerabilities across different judge-LLMsizes, families and methods. Our findings raise significant concerns on thereliability of LLMs-as-a-judge methods, and underscore the importance ofaddressing vulnerabilities in LLM assessment methods before deployment inhigh-stakes real-world scenarios.