RL Assumptions: Recommended Reading

April 24, 2026 4 min read Updated: April 26, 2026

Recommended Reading

This list extends the RL Assumptions series with the literature the interactive posts are in conversation with. Entries marked ✦ are the works I would hand someone starting out; the rest is depth.

The sections mirror the series structure: the foundational textbook first, then the classical tabular and linear methods, then deep RL, then the model-based and universal-agent frame that anchors the series’ final synthesis.

Foundations

The books you read before everything else.

Reinforcement Learning: An Introduction by Sutton, Barto (2018, 2nd ed.) [book] ✦. The canonical text. Free PDF from the authors. incompleteideas.net.
Dynamic Programming and Optimal Control by Bertsekas (2017, Vol I 4th ed.) [book]. The control-theoretic counterpart. Denser than Sutton-Barto but complementary.
Algorithms for Reinforcement Learning by Szepesvári (2010) [book]. Short, rigorous, free.
Reinforcement Learning Course by Silver (2015) [course]. David Silver’s DeepMind lectures. The standard video curriculum.

Tabular and Linear Methods

The classical core: Q-learning, SARSA, TD, and their convergence theory.

Learning from Delayed Rewards by Watkins (1989) [paper]. The Q-learning thesis.
Q-Learning by Watkins, Dayan (1992) [paper] ✦. The convergence-to-optimal proof for tabular Q-learning. Machine Learning 8.
Learning to Predict by the Methods of Temporal Differences by Sutton (1988) [paper]. The original TD-learning paper. Machine Learning 3.
An Analysis of Temporal-Difference Learning with Function Approximation by Tsitsiklis, Van Roy (1997) [paper] ✦. Why linear TD converges and nonlinear TD may not. IEEE Trans. Automatic Control 42.
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning by Williams (1992) [paper]. REINFORCE, the ancestor of every modern policy-gradient method. Machine Learning 8.
Policy Gradient Methods for Reinforcement Learning with Function Approximation by Sutton, McAllester, Singh, Mansour (2000) [paper]. The policy-gradient theorem. NIPS.

Deep RL

The practical explosion after 2013, and the architectures that made function approximation work.

Human-Level Control through Deep Reinforcement Learning by Mnih, Kavukcuoglu, Silver, et al. (2015) [paper] ✦. DQN. The paper that showed deep RL works. Nature 518. arXiv.
Continuous Control with Deep Reinforcement Learning by Lillicrap, Hunt, Pritzel, et al. (2016) [paper]. DDPG; deep RL in continuous action spaces. arXiv.
Trust Region Policy Optimization by Schulman, Levine, Moritz, Jordan, Abbeel (2015) [paper]. TRPO. arXiv.
Proximal Policy Optimization Algorithms by Schulman, Wolski, Dhariwal, Radford, Klimov (2017) [paper] ✦. PPO. The workhorse of modern policy-gradient RL. arXiv.
Soft Actor-Critic by Haarnoja, Zhou, Abbeel, Levine (2018) [paper]. SAC. Maximum-entropy RL done right. arXiv.
A Distributional Perspective on Reinforcement Learning by Bellemare, Dabney, Munos (2017) [paper]. C51 and the distributional-RL program. arXiv.

Model-Based and Universal RL

The horizon the series closes on: representation is prior, and the limit of the prior is AIXI.

Universal Artificial Intelligence by Hutter (2005) [book] ✦. AIXI, the incomputable universal agent. Publisher.
Mastering the Game of Go without Human Knowledge by Silver, Schrittwieser, Simonyan, et al. (2017) [paper]. AlphaGo Zero: model-based RL at scale. Nature 550.
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model by Schrittwieser, Antonoglou, Hubert, et al. (2020) [paper] ✦. MuZero. The closest-to-AIXI practical system, with learned dynamics. Nature 588.
World Models by Ha, Schmidhuber (2018) [paper]. Learning a generative model of the environment and planning inside it. arXiv.
Dream to Control: Learning Behaviors by Latent Imagination by Hafner, Lillicrap, Ba, Norouzi (2020) [paper]. Dreamer; model-based RL with latent dynamics. arXiv.

How this list is opinionated

The thread: every RL algorithm is an approximation of an ideal (AIXI at the limit), and the approximation lives in the representation you choose: tabular, linear, neural, model-based, or model-free. Those choices are priors, whether you acknowledge them or not. Works that illuminate that thread are in.

Excluded on purpose: most of the multi-agent literature (different problem), inverse RL and imitation learning (adjacent, different framing), and bandit theory (the series is about full RL, not the degenerate-horizon case).

If you read three things first, read Sutton-Barto chapters 1-6, the DQN paper, and MuZero. Tabular foundations, the deep-RL breakthrough, the model-based near-optimum: three snapshots of the arc the series traces.