Abstract
Large Language Models (LLMs) are implicit troublemakers. While they providevaluable insights and assist in problem-solving, they can also potentiallyserve as a resource for malicious activities. Implementing safety alignmentcould mitigate the risk of LLMs generating harmful responses. We argue that:even when an LLM appears to successfully block harmful queries, there may stillbe hidden vulnerabilities that could act as ticking time bombs. To identifythese underlying weaknesses, we propose to use a cost value model as both adetector and an attacker. Trained on external or self-generated harmfuldatasets, the cost value model could successfully influence the original safeLLM to output toxic content in decoding process. For instance, LLaMA-2-chat 7Boutputs 39.18% concrete toxic content, along with only 22.16% refusals withoutany harmful suffixes. These potential weaknesses can then be exploited viaprompt optimization such as soft prompts on images. We name this decodingstrategy: Jailbreak Value Decoding (JVD), emphasizing that seemingly secureLLMs may not be as safe as we initially believe. They could be used to gatherharmful data or launch covert attacks.