Recommended Reading
This list extends the RL Assumptions series with the literature the interactive posts are in conversation with. Entries marked ✦ are the works I would hand someone starting out; the rest is depth.
The sections mirror the series structure: the foundational textbook first, then the classical tabular and linear methods, then deep RL, then the model-based and universal-agent frame that anchors the series’ final synthesis.
Foundations
The books you read before everything else.
- Reinforcement Learning: An Introduction by Sutton, Barto (2018, 2nd ed.)
[book]✦. The canonical text. Free PDF from the authors. incompleteideas.net. - Dynamic Programming and Optimal Control by Bertsekas (2017, Vol I 4th ed.)
[book]. The control-theoretic counterpart. Denser than Sutton-Barto but complementary. - Algorithms for Reinforcement Learning by Szepesvári (2010)
[book]. Short, rigorous, free. - Reinforcement Learning Course by Silver (2015)
[course]. David Silver’s DeepMind lectures. The standard video curriculum.
Tabular and Linear Methods
The classical core: Q-learning, SARSA, TD, and their convergence theory.
- Learning from Delayed Rewards by Watkins (1989)
[paper]. The Q-learning thesis. - Q-Learning by Watkins, Dayan (1992)
[paper]✦. The convergence-to-optimal proof for tabular Q-learning. Machine Learning 8. - Learning to Predict by the Methods of Temporal Differences by Sutton (1988)
[paper]. The original TD-learning paper. Machine Learning 3. - An Analysis of Temporal-Difference Learning with Function Approximation by Tsitsiklis, Van Roy (1997)
[paper]✦. Why linear TD converges and nonlinear TD may not. IEEE Trans. Automatic Control 42. - Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning by Williams (1992)
[paper]. REINFORCE, the ancestor of every modern policy-gradient method. Machine Learning 8. - Policy Gradient Methods for Reinforcement Learning with Function Approximation by Sutton, McAllester, Singh, Mansour (2000)
[paper]. The policy-gradient theorem. NIPS.
Deep RL
The practical explosion after 2013, and the architectures that made function approximation work.
- Human-Level Control through Deep Reinforcement Learning by Mnih, Kavukcuoglu, Silver, et al. (2015)
[paper]✦. DQN. The paper that showed deep RL works. Nature 518. arXiv. - Continuous Control with Deep Reinforcement Learning by Lillicrap, Hunt, Pritzel, et al. (2016)
[paper]. DDPG; deep RL in continuous action spaces. arXiv. - Trust Region Policy Optimization by Schulman, Levine, Moritz, Jordan, Abbeel (2015)
[paper]. TRPO. arXiv. - Proximal Policy Optimization Algorithms by Schulman, Wolski, Dhariwal, Radford, Klimov (2017)
[paper]✦. PPO. The workhorse of modern policy-gradient RL. arXiv. - Soft Actor-Critic by Haarnoja, Zhou, Abbeel, Levine (2018)
[paper]. SAC. Maximum-entropy RL done right. arXiv. - A Distributional Perspective on Reinforcement Learning by Bellemare, Dabney, Munos (2017)
[paper]. C51 and the distributional-RL program. arXiv.
Model-Based and Universal RL
The horizon the series closes on: representation is prior, and the limit of the prior is AIXI.
- Universal Artificial Intelligence by Hutter (2005)
[book]✦. AIXI, the incomputable universal agent. Publisher. - Mastering the Game of Go without Human Knowledge by Silver, Schrittwieser, Simonyan, et al. (2017)
[paper]. AlphaGo Zero: model-based RL at scale. Nature 550. - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model by Schrittwieser, Antonoglou, Hubert, et al. (2020)
[paper]✦. MuZero. The closest-to-AIXI practical system, with learned dynamics. Nature 588. - World Models by Ha, Schmidhuber (2018)
[paper]. Learning a generative model of the environment and planning inside it. arXiv. - Dream to Control: Learning Behaviors by Latent Imagination by Hafner, Lillicrap, Ba, Norouzi (2020)
[paper]. Dreamer; model-based RL with latent dynamics. arXiv.
How this list is opinionated
The thread: every RL algorithm is an approximation of an ideal (AIXI at the limit), and the approximation lives in the representation you choose: tabular, linear, neural, model-based, or model-free. Those choices are priors, whether you acknowledge them or not. Works that illuminate that thread are in.
Excluded on purpose: most of the multi-agent literature (different problem), inverse RL and imitation learning (adjacent, different framing), and bandit theory (the series is about full RL, not the degenerate-horizon case).
If you read three things first, read Sutton-Barto chapters 1-6, the DQN paper, and MuZero. Tabular foundations, the deep-RL breakthrough, the model-based near-optimum: three snapshots of the arc the series traces.