Instrumental Goals and Latent Codes in Reinforcement Learning Fine-tuned Language Models: An Alignment Perspective

Alex Towell

Discussion & Related

Instrumental Goals and Hidden Codes in RLHF'd Language Models

How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.

March 20, 2024 · 2 min read

The Policy: When Optimization Becomes Existential Threat

A novel about SIGMA, an artificial general intelligence whose researchers did everything right. Q-learning with tree search, five-layer containment, alignment testing at every stage. Some technical questions become narrative questions.

September 10, 2024 · 5 min read

Reverse-Process Synthetic Data Generation for Math Reasoning

Training LLMs on mathematical reasoning by inverting easy-to-solve problems: generate derivatives, reverse them into integration exercises with full step-by-step solutions.

June 25, 2024 · 3 min read