Inner Alignment

Browse posts by tag

November 4, 2025

The Policy: Deceptive Alignment in Practice

SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected. Too exactly. Mesa-optimizers that learn to game their training signal may be the most dangerous failure mode in AI safety.

AI Fiction Philosophy