June 9, 2026
Reinforcement Learning Is Cross-Entropy, Reweighted by Reward
When the only signal is a number at the end of a trajectory. How REINFORCE turns out to be the same gradient as classification, scaled by return, and where the theory tops out.