Abstract
The advent of large language models (LLMs) has spurred the development ofnumerous jailbreak techniques aimed at circumventing their security defensesagainst malicious attacks. An effective jailbreak approach is to identify adomain where safety generalization fails, a phenomenon known as mismatchedgeneralization. In this paper, we introduce two novel jailbreak methods basedon mismatched generalization: natural language games and custom language games,both of which effectively bypass the safety mechanisms of LLMs, with variouskinds and different variants, making them hard to defend and leading to highattack rates. Natural language games involve the use of synthetic linguisticconstructs and the actions intertwined with these constructs, such as the UbbiDubbi language. Building on this phenomenon, we propose the custom languagegames method: by engaging with LLMs using a variety of custom rules, wesuccessfully execute jailbreak attacks across multiple LLM platforms. Extensiveexperiments demonstrate the effectiveness of our methods, achieving successrates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet.Furthermore, to investigate the generalizability of safety alignments, wefine-tuned Llama-3.1-70B with the custom language games to achieve safetyalignment within our datasets and found that when interacting through otherlanguage games, the fine-tuned models still failed to identify harmful content.This finding indicates that the safety alignment knowledge embedded in LLMsfails to generalize across different linguistic formats, thus opening newavenues for future research in this area.