Justin's Linklog

How Did Elon Musk Turn Grok Into MechaHitler?

Excellent description of the layers of tuning available for LLMs, and the risks involved, as demonstrated by Grok's recent "MechaHitler" incident:

LLMs become “woke” because they are trained to be pro-social — to be helpful, kindly, truthful, and not to say bigoted or cruel things. Training it to do the opposite — to be anti-woke — is to activate every antisocial association in the English language, including racism, sexism, cruelty, dishonesty, and Nazism. According to a vast statistical representation of the English language constructed by none other than Elon Musk, that’s what anti-wokeness is. “Elon Musk is repeatedly insisting, no, no, there’s a difference between what I’m doing and being a Nazi. And what the model keeps telling him is, statistically, that’s not the case,” said Schou.

A key implication here is that LLMs will tend to converge on similar types of behavior. The above researchers were not using Grok, but they found the exact same pattern of powerful association groupings of good and evil in other LLMs — and these can’t be removed through fine-tuning. One could imagine the RLHF process including adjustments of every parameter, but experts said that this will degrade or break the model. The matrices in an LLM are arranged hierarchically, and the top layers get fixed in place relatively early in pretraining. Mess with them, and the model will stop working. Instead, RLHF developed more like a series of gates that prevent undesired outcomes. “The model completes most of the computation that it needs in order to reach a particular outcome,” [Andreas] Schou said. “And then says, ‘Wait, wait, wait, I’m saying that I’m MechaHitler. No, I’m not doing that.’”

One could try to assemble a custom dataset with nothing but “conservatism minus the Nazis” and train a new model from scratch, but not only would that be extremely expensive, it also would not be nearly as strong as leading models, since its universe of available training data would be much smaller.

Funnily enough, the latter approach is exactly what Elon Musk claims xAI are now doing.

Tags: llms language technology ai mechahitler grok tuning rlhf woke training

Archives

How Did Elon Musk Turn Grok Into MechaHitler?