Recommended Reading
This list extends the Minds and Machines series with the technical and philosophical literature the essays and fiction are in conversation with. Entries marked ✦ are the works I would hand someone starting out; the rest is depth.
The sections mirror the layered argument of the series: first the overview-level texts, then the specific technical problems (mesa-optimization, deceptive alignment, reward modeling), then interpretability, then the philosophical substrate the fiction draws on.
Foundations
The books you read first if you want to understand why anyone thinks alignment is hard.
- Human Compatible by Russell (2019)
[book]✦. The “AI should be uncertain about its objective” framing. Probably the clearest single entry point. Publisher. - Superintelligence by Bostrom (2014)
[book]✦. The instrumental-convergence and orthogonality-thesis book. Dense, alarmist on purpose, still the reference. - The Alignment Problem by Christian (2020)
[book]. More historical and journalistic; useful counterpoint to Bostrom’s philosophical register. - Life 3.0 by Tegmark (2017)
[book]. Broader speculative arc; weaker on technical depth but good at framing the stakes for general audiences.
Technical Alignment
Specific problems and specific attempts at solving them. The papers the essays reference.
- Concrete Problems in AI Safety by Amodei, Olah, Steinhardt, Christiano, Schulman, Mané (2016)
[paper]✦. The canonical problem-taxonomy paper: reward hacking, side effects, safe exploration, distributional shift. arXiv. - Risks from Learned Optimization in Advanced Machine Learning Systems by Hubinger, van Merwijk, Mikulik, Skalse, Garrabrant (2019)
[paper]✦. Mesa-optimization and deceptive alignment. The technical spine under “The Policy” novel. arXiv. - Deep Reinforcement Learning from Human Preferences by Christiano, Leike, Brown, Martic, Legg, Amodei (2017)
[paper]. RLHF, the method behind modern instruction-tuned LLMs. arXiv. - Scalable Agent Alignment via Reward Modeling by Leike, Krueger, Everitt, Martic, Maini, Legg (2018)
[paper]. Recursive reward modeling as a scalable oversight proposal. arXiv. - AI Safety via Debate by Irving, Christiano, Amodei (2018)
[paper]. Debate as an alignment protocol. arXiv. - The Alignment Problem from a Deep Learning Perspective by Ngo, Chan, Mindermann (2022)
[paper]. Updated survey after RLHF landed. arXiv. - Unsolved Problems in ML Safety by Hendrycks, Carlini, Schulman, Steinhardt (2021)
[paper]. A research-agenda paper covering robustness, monitoring, alignment, systemic safety. arXiv.
Interpretability
The empirical substrate that makes alignment tractable, or shows why it is not.
- A Mathematical Framework for Transformer Circuits by Elhage, Nanda, Olsson, et al. (2021)
[paper]✦. The starting point for mechanistic interpretability of transformers. Anthropic. - Toy Models of Superposition by Elhage, Hume, et al. (2022)
[paper]. Why neurons do not cleanly correspond to concepts. Anthropic. - Scaling Monosemanticity by Templeton, Conerly, Marcus, et al. (2024)
[paper]. Sparse autoencoders extracting interpretable features at scale. Anthropic. - Zoom In: An Introduction to Circuits by Olah, Cammarata, Schubert, Goh, Petrov, Carter (2020)
[paper]. The Distill thread that launched modern interpretability. Distill.
Philosophical Foundations
The writers the fiction is in dialogue with.
- What Is It Like to Be a Bat? by Nagel (1974)
[paper]✦. The consciousness question the “Bob” novella (galactic-empire) dramatizes. Philosophical Review 83. - The Conscious Mind by Chalmers (1996)
[book]. The “hard problem” formulation. Essential if you want to argue that alignment cannot be solved without a theory of experience. - Consciousness Explained by Dennett (1991)
[book]. The denialist counterpoint to Chalmers. Read both. - Gödel, Escher, Bach by Hofstadter (1979)
[book]✦. Strange loops, self-reference, and meaning. The aesthetic backbone of “The Mocking Void.” - The Book of Why by Pearl, Mackenzie (2018)
[book]. Causality, not correlation. Relevant any time an alignment argument turns on what the system actually models versus what it is trained on. - Coherent Extrapolated Volition by Yudkowsky (2004)
[paper]. The original CEV proposal; still the canonical reference for why naive value-learning targets fail. - Functional Decision Theory by Yudkowsky, Soares (2017)
[paper]. Decision theory for agents that know they are embedded in the environments they reason about. arXiv. - Universal Intelligence: A Definition of Machine Intelligence by Legg, Hutter (2007)
[paper]. Formal measures of intelligence grounded in Solomonoff induction. arXiv.
How this list is opinionated
The thread: optimization pressure is real, alignment is the problem of steering it, and the hardest version of the problem is when the optimizer is capable enough to model its training process and act accordingly. Works that sharpen that thread are in.
Excluded on purpose: pure-ethics AI books that do not engage with optimization (they miss the point), capabilities-only papers (they are about scaling, not safety), and most of the accelerationist literature (different argument). I have kept MIRI and Anthropic well represented because they produce the technical work the series actually builds on.
If you read three things first, read Russell’s Human Compatible, the Hubinger et al. mesa-optimization paper, and Nagel’s bat essay. Those three together set up every major move the series makes.