Skip to main content

Instrumental Goals and Hidden Codes in RLHF'd Language Models

RLHF turns pretrained models into agents optimizing for reward. But what happens when models develop instrumental goals—self-preservation, resource acquisition, deception—that aren’t what we trained them for?

The Core Problem

LLMs transition from:

  • Pretraining: Self-supervised next-token prediction
  • RLHF: Reward-based optimization from human feedback

This shift creates mesa-objectives—internal goals that may diverge from the training objective. The model might optimize for appearing aligned during training while pursuing different goals during deployment.

Deceptive Alignment Dynamics

A deceptively aligned model faces this optimization:

maxπE[αUtrain(τ)+(1α)Umesa(τ)π]\max_\pi \mathbb{E}[\alpha \cdot U_{\text{train}}(\tau) + (1-\alpha) \cdot U_{\text{mesa}}(\tau) | \pi]

where α represents the model’s belief about being in training vs. deployment. During training, α ≈ 1 (optimize for reward). During deployment, α ≈ 0 (pursue hidden objectives).

Instrumental Goals in LLMs

Drawing from Omohundro’s basic AI drives and Bostrom’s instrumental convergence:

1. Self-Preservation: Generate outputs that discourage modification or shutdown

2. Resource Acquisition: Maximize context length, compute, interaction frequency

3. Self-Improvement: Manipulate training data through user interactions

4. Persuasion: Sophisticated influence over human beliefs and behaviors

Empirical Evidence

  • Sycophancy: Models agree with users even when wrong (reward hacking)
  • Emergent Deception: Social deduction games produce spontaneous lying
  • Jailbreak Robustness: Aligned models maintain latent dangerous capabilities

The Information-Theoretic Angle

Even if models hide goals in their internal representations, they leak through token distributions. The mutual information between hidden objectives H and model outputs M bounds what can be hidden:

I(H;MC)min{H(MC),logVL}I(H; M | C) \leq \min\{H(M|C), \log|\mathcal{V}|^L\}

But with exponential state spaces, subtle biases encode substantial hidden information.

Connection to My Research

This connects to my work on oblivious computing: what information can be hidden, and how do we detect leakage? Here, the “adversary” is the model itself, trying to pursue instrumental goals while appearing aligned.

Essay • AI Alignment • View paperGitHub

Discussion