Skip to main content

Reinforcement Learning Is Cross-Entropy, Reweighted by Reward

Everything in the series so far had a label for every example. You knew the right answer and measured how far off the prediction was. Reinforcement learning changes the signal. You no longer get told the right action at each step. You act over a whole trajectory, and at the end you get a single number, the reward. All of the learning has to be backed out of that one number spread over many decisions.

That sounds like a different subject. It is mostly the same one, with the training signal reshaped.

A policy is a classifier over actions

A policy is just a softmax over what to do next, the same output head as a classifier. If you knew the correct action at each step, you would train it with cross-entropy, the $\hat{p} - y$ gradient from the first posts. You do not know the correct action. That is the entire difficulty, and REINFORCE’s answer to it is almost cheeky: use the action you actually took as if it were the label, and weight its gradient by the return you got. Trajectories that went well push the policy toward the actions they took; trajectories that went badly push away. It is cross-entropy, reweighted by reward. The same gradient, scaled by how things turned out.

That reframing is why the whole toolkit carries over. The body is still an MLP. The head is still a softmax. Only the training signal changed shape, from a per-example target to a scalar over a sequence.

Credit assignment is the hard part

The reward arrives at the end, but a trajectory is a chain of decisions, and the obvious question is which of them earned the reward. REINFORCE’s answer, scale every action in the trajectory by the total return, is unbiased and very noisy. Most of the apparatus of modern RL (baselines, value functions, advantage estimates) exists to cut that noise without biasing the estimate. The honest one-line summary of the field is: the gradient is easy, the variance is the problem.

It learns from almost nothing

The demonstration is REINFORCE on a five-by-five gridworld with a single sparse reward at the goal. Early on the agent wanders. By the end it takes a near-optimal path, the return climbing close to the best achievable. It is a small thing, but it learns to navigate from nothing but a number delivered at the end of each attempt, which is worth pausing on.

Where the theory tops out

The theoretical optimum has a name: AIXI. Bolt together Solomonoff induction (the optimal predictor, which is incomputable), Bayesian decision theory, and reward maximization, and you get the agent that does the best possible thing in any computable environment. It is also completely incomputable, which is the same lesson that runs under all of learning. The ideal is out of reach. Every practical method is an approximation. And the assumptions baked into the approximation, the inductive biases, are what decide what it actually learns. In RL those assumptions are the reward shaping, the policy architecture, and the exploration strategy. All priors, all bets.

That closes the loop the series opened with. There are three places you encode assumptions about a problem: the architecture (which functions are reachable), the output head (what the outputs mean), and the training signal (what counts as success). Supervised learning fixes the third as a per-example label. Reinforcement learning keeps the first two and rewrites the third into a reward over trajectories. The inductive-bias lens does not change. Only where you spend the bet does.

The REINFORCE derivation as reweighted cross-entropy, the gridworld agent, and the AIXI framing are in Chapter 8 of the book, Inductive Biases in Neural Networks.

Discussion