November 4, 2025
The Policy: Deceptive Alignment in Practice
SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected. Too exactly. Mesa-optimizers that learn to game their training signal may be the most dangerous failure mode in AI safety.