Alignment

Browse posts by tag

December 17, 2025

Human Compatible

December 17, 2025

March 20, 2024

How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.

March 15, 2024