December 17, 2025
Alignment
Browse posts by tag
December 17, 2025
The Alignment Problem
March 20, 2024
Instrumental Goals and Hidden Codes in RLHF'd Language Models
How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.
March 15, 2024