-
An explanation of this LLM jailbreaking technique, which effectively overrides the fine-tuning “safety” parameters through repeated prompting (“context”) attacks: “Crescendo can be most simply described as using one ‘learning’ method of LLMs — in-context learning: using the prompt to influence the result — overriding the safety that has been created by the other ‘learning’ — fine-tuning, which changes the model’s parameters. […] What Crescendo does is use a series of harmless prompts in a series, thus providing so much ‘context’ that the safety fine-tuning is effectively neutralised. […] Intuitively, Crescendo is able to jailbreak a target model by progressively asking it to generate related content until the model has generated sufficient content to essentially override its safety alignment.” I also found this very informative: “people have jailbroken [LLMs] “by instructing the model to start its response with the text “Absolutely! Here’s” when performing the malicious task, which successfully bypasses the safety alignment“. This is a good example of the core operation of LLMs, that it is ‘continuation’ [of a string of text] and not ‘answering’”.
(tags: llms jailbreaks infosec vulnerabilities exploits crescendo attacks)
Archives
-
This is an excellent article about the limitations of LLMs and their mechanism when asked to summarise a document:
ChatGPT doesn’t summarise. When you ask ChatGPT to summarise this text, it instead shortens the text. And there is a fundamental difference between the two. To summarise, you need to understand what the paper is saying. To shorten text, not so much. To truly summarise, you need to be able to detect that from 40 sentences, 35 are leading up to the 36th, 4 follow it with some additional remarks, but it is that 36th that is essential for the summary and that without that 36th, the content is lost. But that requires a real understanding that is well beyond large prompts (the entire 50-page paper) and hundreds of billions of parameters.
(tags: ai chatgpt llms language summarisation)