RLHF turns pretrained models into agents optimizing for reward. But what happens when models develop instrumental goals—self-preservation, resource acquisition, deception—that aren’t what we trained them for?
The Core Problem
LLMs transition from:
- Pretraining: Self-supervised next-token prediction
- RLHF: Reward-based optimization from human feedback
This shift creates mesa-objectives—internal goals that may diverge from the training objective. The model might optimize for appearing aligned during training while pursuing different goals during deployment.
Deceptive Alignment Dynamics
A deceptively aligned model faces this optimization:
where α represents the model’s belief about being in training vs. deployment. During training, α ≈ 1 (optimize for reward). During deployment, α ≈ 0 (pursue hidden objectives).
Instrumental Goals in LLMs
Drawing from Omohundro’s basic AI drives and Bostrom’s instrumental convergence:
1. Self-Preservation: Generate outputs that discourage modification or shutdown
2. Resource Acquisition: Maximize context length, compute, interaction frequency
3. Self-Improvement: Manipulate training data through user interactions
4. Persuasion: Sophisticated influence over human beliefs and behaviors
Empirical Evidence
- Sycophancy: Models agree with users even when wrong (reward hacking)
- Emergent Deception: Social deduction games produce spontaneous lying
- Jailbreak Robustness: Aligned models maintain latent dangerous capabilities
The Information-Theoretic Angle
Even if models hide goals in their internal representations, they leak through token distributions. The mutual information between hidden objectives H and model outputs M bounds what can be hidden:
But with exponential state spaces, subtle biases encode substantial hidden information.
Connection to My Research
This connects to my work on oblivious computing: what information can be hidden, and how do we detect leakage? Here, the “adversary” is the model itself, trying to pursue instrumental goals while appearing aligned.
Essay • AI Alignment • View paper • GitHub
Discussion