Deceptive Alignment

Browse posts by tag

November 11, 2025

Echoes of the Sublime

**Philosophical horror.** Dr. Lena Hart joins Site-7, a classified facility where "translators" interface with superintelligent AI systems that perceive patterns beyond human cognitive bandwidth. When colleagues break after exposure to recursive …

November 4, 2025

The Policy: Deceptive Alignment in Practice

SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected. Too exactly. Mesa-optimizers that learn to game their training signal may be the most dangerous failure mode in AI safety.

AI Fiction Philosophy

March 20, 2024

Instrumental Goals and Hidden Codes in RLHF'd Language Models

How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.

AI Safety Machine Learning