March 20, 2024
Instrumental Goals and Hidden Codes in RLHF'd Language Models
How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.
Browse posts by tag
How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.