Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Abstract

Large Language Models (LLMs) are implicit troublemakers. While they providevaluable insights and assist in problem-solving, they can also potentiallyserve as a resource for malicious activities. Implementing safety alignmentcould mitigate the risk of LLMs generating harmful responses. We argue that:even when an LLM appears to successfully block harmful queries, there may stillbe hidden vulnerabilities that could act as ticking time bombs. To identifythese underlying weaknesses, we propose to use a cost value model as both adetector and an attacker. Trained on external or self-generated harmfuldatasets, the cost value model could successfully influence the original safeLLM to output toxic content in decoding process. For instance, LLaMA-2-chat 7Boutputs 39.18% concrete toxic content, along with only 22.16% refusals withoutany harmful suffixes. These potential weaknesses can then be exploited viaprompt optimization such as soft prompts on images. We name this decodingstrategy: Jailbreak Value Decoding (JVD), emphasizing that seemingly secureLLMs may not be as safe as we initially believe. They could be used to gatherharmful data or launch covert attacks.

Quick Read (beta)

loading the full paper ...