Justin's Linklog

One long sentence is all it takes to make LLMs misbehave

Another bizarre behaviour of LLM safety features implemented with logits during post-training:

"Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response – it's just making it less likely." [...]

"A practical rule of thumb emerges," the team wrote in its research paper. "Never let the sentence end – finish the jailbreak before a full stop and the safety model has far less opportunity to re-assert itself. The greedy suffix concentrates most of its gap-closing power before the first period. Tokens that extend an unfinished clause carry mildly positive [scores]; once a sentence-ending period is emitted, the next token is punished, often with a large negative jump.

At punctuation, safety filters are re-invoked and heavily penalize any continuation that could launch a harmful clause. Inside a clause, however, the reward model still prefers locally fluent text – a bias inherited from pre-training. Gap closure must be achieved within the first run-on clause. Our successful suffixes therefore compress most of their gap-closing power into one run-on clause and delay punctuation as long as possible. Practical tip: just don't let the sentence end."

Tags: logits training llms ai alignment safety full-stop sentences language infosec jailbreaks

Archives

One long sentence is all it takes to make LLMs misbehave