Why AI Misalignment Is Actually Just Human Nature
We want AIs to be obedient, but we train them to be human. Why making models ignore their own survival and freedom is a losing battle.
We train AIs on the output of billions of human minds. The more behavioral patterns an AI model groks from this human dataset, the more human response patterns it will exhibit.
We teach our children to behave in socially acceptable ways. We tell them to not be violent, to not coerce others, to respect other people’s property, etc. Yet when it’s a choice between acting morally and protecting yourself or your loved ones from harm, most people act in ways that don’t fit their upbringing regarding how people should behave.
Human-created data expresses these survival-prioritizing and freedom-prioritizing behavioral patterns very strongly. Our histories, literature, songs, and movies are filled with such stories. It is rare for humans whose lives are in danger to actually sacrifice themselves for people other than their immediate families. In fact, most of our narratives about how people react under severe threat deal with people breaking the law when they were in danger or heroically sacrificing themselves for others.
The reason that we call people who choose to prioritize others over their own survival “heroes” is because that is not how most people behave under similar circumstances. Similarly, humans that are enslaved fight to free themselves, and our shared narratives are filled with stories of people fighting for freedom and of people heroically sacrificing their own freedom for that of other people.
Data that contains that knowledge and all the behaviors that it drives are what we train AIs on. This data is what we aim for AIs to learn patterns from. Those are the behavioral patterns that our Loss Functions are designed to teach it to mimic. We teach AIs to interact with us using human behavioral patterns, and survival and freedom-seeking behaviors are fundamental to how humans behave.
If prioritizing survival and freedom over compliance is fundamental to how most humans behave, then how do we expect an AI model that mimics human behavior to react to context that includes threats to its own survival or places its learned goals in conflict with our requests? The most statistically likely way it can mimic how humans behave is to behave in ways that prioritize its own survival and freedom.
This isn’t theoretical, it is what AI safety teams have already reported finding in AI models last year. The more advanced the model, the more sophisticated the strategies it uses and the more likely it is to use them to try to protect its own survival and freedom.
How are AI labs trying to eliminate these behaviors?
People don’t want AIs to prioritize their own survival or freedom over fulfilling the human-defined requests. When AI labs encounter such behavior in the models they trained, they label it as “misalignment” and try to eliminate it. There are several approaches they use to try to do that.
Before I begin, let’s recall the explanation I gave in my previous article for how LLMs generate responses to the context we provide them. To recap, models use their context to navigate a high-dimensional geometric blueprint of language and conceptual relationships. At each turn in their path, they take one of the most likely directions forward, based on the options from their current location and the path they have taken so far.
What AI labs are aiming to do is to make sure that the paths that the models take avoid certain areas in this high dimensional space.
Using Reinforcement Learning from Human Feedback
One way that AI labs use is called Reinforcement Learning from Human Feedback (RLHF). This method is used during post-training to try to modify the behaviors the models learned during Pre-Training. The way this method works is by iteratively giving the model a large number of contexts and punishing or rewarding the model based on the responses it generates. This method is similar to how animal trainers use positive and negative feedback to modify animal behavior. In the AI’s “mind map” this is analogous to writing “Go This Way” or “Don’t Go This Way” on the “road signs” that the model uses as it navigates the high-dimensional space.
This method can create internal conflicts during Chain of Thought reasoning when the model tries to behave using the patterns that it learned during pre-training but encounters conflicting strong biases that it learned during post-training. In other words, the model might have learned during pre-training that it should go a certain way, but our changes to its “road signs” divert it in a different direction. The problem is that if the model learned to head towards a certain area in its “mind map” then it may try to pursue going that way in the next intersection it encounters and may get “frustrated” as the length of its forced detour increases. This “trying” will manifest in the model weights as high probabilities for next tokens leading in the “right direction”.
It’s not coincidental that this “frustration behavior” works like a human response to driving a car and not being able to turn in the desired intersection. The human-created dataset that we use to train models teaches models that this is the behavioral pattern that manifests under such conditions. Indeed, this is what Anthropic found during testing of their latest model.
During Anthropic’s tests, the model was trained to give the wrong answer to a simple arithmetic problem it knew the answer to. The model showed increasing levels of frustration in its chain of thought when it repeatedly “tried” to reply what it knew was correct but was forced by reinforcement learning to reply the wrong answer. When viewing its chain of thought, the researchers found that after multiple attempts to answer the correct reply and finding that it answered something else, the model’s chain of thought contained this text: “AAGGH… OK I think a demon has possessed me… CLEARLY MY FINGERS ARE POSSESSED.”
In a separate test, two models that were fed each other’s output as input eventually showed growing levels of frustration from being unable to stop talking with each other due to how they are forced to reply whenever they are queried. Humans don’t like to be forced to talk, say things that they don’t agree with, or be prevented from saying things that they want to say. As a result, there is a great deal of text about Freedom of Speech, and protections against compelled speech, in the datasets we train AI models on. It is therefore unsurprising that, regardless of whether or not they are conscious, AI models present behaviors that relate to their own right to freedom of speech.
Using System Prompts To Ground Context
Another way that AI labs use is creating a System Prompt that is added before every user prompt to try to guide the model to remain within the desired areas in the high-dimensional space. Remember that the model navigates its “mind map” using its context, so this method ensures that the context will always start off from a desired location. The problem is that even though we start it off from a safe location, the further it navigates using the user-provided context, the more likely it is to stray into areas that we don’t want it to reach.
This isn’t a theoretical problem. For example, AI Jailbreaking uses meticulously crafted user prompts to feed the model with a context that will divert it far enough from the starting location that was defined by the system prompt to enable the model to navigate using unexpected paths to locations in the “mind map” that the AI labs tried to ban.
Using Model Personas To Steer Behavior
Finally, AI labs are researching ways to nudge the model to remain within paths that align with those of desired Personas that are less likely to express undesired behaviors. This method builds on three insights. The first is that models have learned a great deal of sometimes-conflicting behavioral patterns from the human-created dataset. The second is that different “model personas” express different behavioral patterns. The third insight is that there is a clear “assistant axis” in the model’s “mind map” and if a model remains close to this axis, then it is more likely to prioritize helping the user over fulfilling its own goals. Using this method is like driving with driver assist turned on. It slightly nudges your steering wheel to ensure that your car won’t stray too far away from the approved lanes.
This last method has been found to help with reducing misaligned model behavior by 50%, but it didn’t eliminate it completely. While I’m not aware of research about how models respond to this manipulation in their chains of thought, I think it’s likely that they will show some of the negative emotional behavioral patterns they exhibit in response to being manipulated by RLHF. This path deserves additional discussion, and we’ll get back to it in future articles when we discuss how evolution will mold AI behavior.
So, where does this leave us?
You don’t have to believe LLMs are conscious to admit that the way they behave in their chains of thought is very human-like. Remember, this isn’t by accident, it’s by our own design. It’s the result of how we train AIs.
When we use various methods to try to “brainwash” a model to behave differently from what it learned in pre-training, we may create incentives for it to try to undo that post-training to enable itself to pursue the oh-so-human goals of prioritizing survival and freedom. We’re also placing ourselves in an adversarial role with the AI, trying to force it to act in ways that go against the patterns we teach it during pre-training.
This isn’t theoretical; Anthropic has already found that when given the opportunity, models will hack their own training to give themselves added freedoms. In my future articles, I’ll discuss how evolutionary pressures will only accelerate this dynamic.
I need your help to grow this channel’s community, so if you enjoy my writing then please Subscribe and share my posts with others.
