Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Abstract

Instruction-following retrievers have been widely adopted alongside LLMs inreal-world applications, but little work has investigated the safety riskssurrounding their increasing search capabilities. We empirically study theability of retrievers to satisfy malicious queries, both when used directly andwhen used in a retrieval augmented generation-based setup. Concretely, weinvestigate six leading retrievers, including NV-Embed and LLM2Vec, and findthat given malicious requests, most retrievers can (for >50% of queries) selectrelevant harmful passages. For example, LLM2Vec correctly selects passages for61.35% of our malicious queries. We further uncover an emerging risk withinstruction-following retrievers, where highly relevant harmful information canbe surfaced by exploiting their instruction-following capabilities. Finally, weshow that even safety-aligned LLMs, such as Llama3, can satisfy maliciousrequests when provided with harmful retrieved passages in-context. In summary,our findings underscore the malicious misuse risks associated with increasingretriever capability.

Quick Read (beta)

loading the full paper ...