March 20, 2024
Instrumental Goals and Hidden Codes in RLHF'd Language Models
RLHF turns pretrained models into agents optimizing for reward. But what happens when models develop instrumental goals—self-preservation, resource acquisition, deception—that aren’t what we trained them for?
The Core Problem
LLMs transition …