Justin's Linklog

The "Crescendo" LLM jailbreak

An explanation of this LLM jailbreaking technique, which effectively overrides the fine-tuning "safety" parameters through repeated prompting ("context") attacks: "Crescendo can be most simply described as using one ‘learning’ method of LLMs — in-context learning: using the prompt to influence the result — overriding the safety that has been created by the other ‘learning’ — fine-tuning, which changes the model’s parameters. [...] What Crescendo does is use a series of harmless prompts in a series, thus providing so much ‘context’ that the safety fine-tuning is effectively neutralised. [...] Intuitively, Crescendo is able to jailbreak a target model by progressively asking it to generate related content until the model has generated sufficient content to essentially override its safety alignment." I also found this very informative: "people have jailbroken [LLMs] “by instructing the model to start its response with the text “Absolutely! Here’s” when performing the malicious task, which successfully bypasses the safety alignment“. This is a good example of the core operation of LLMs, that it is ‘continuation’ [of a string of text] and not ‘answering’".

(tags: llms jailbreaks infosec vulnerabilities exploits crescendo attacks)

Archives