The last post showed that hand-picked features can turn a hopeless tabular agent into one that generalizes across an entire grid after a few thousand episodes. Distance to the nearest item. Normalized coordinates. A handful of scalars, chosen by someone who’d looked at the problem and thought hard about what mattered. That works — if you pick the right ones. If you know the domain. If the relevant structure is the kind of thing a human can articulate.
But what if you don’t? What if the problem is complex enough that you can’t say in advance which combinations of raw inputs will matter? What if the right features aren’t “distance to item” but something subtle — a pattern in how walls cluster, a spatial relationship between the exit and the player’s trajectory, a nonlinear interaction the linear agent can’t see by construction? Neural networks learn their own features. But they don’t learn from nothing. Before a single gradient step is taken, you’ve already made a decision that shapes everything that follows: you chose the architecture. And the architecture decides what kind of features can be learned at all.
Neural function approximation
A neural network is just a parameterized function. It takes the raw state representation as input — the same feature vector the linear agent uses — and outputs Q-values for each action. The difference is what happens in between. Where the linear agent computes a weighted sum of its inputs directly, a neural network passes the inputs through one or more hidden layers first. Each hidden layer applies a linear transformation followed by a nonlinearity, producing an intermediate representation. The output layer then reads off Q-values from that representation.
The hidden layer is where the magic happens, and it’s worth being precise about what “magic” means here. Each hidden unit computes a nonlinear function of the inputs. During training, the weights to each unit are adjusted so that the unit’s activation becomes useful for predicting value. Units don’t learn features we hand-specified — they learn features that minimize prediction error. They discover what to pay attention to. That’s powerful when the right features aren’t obvious, and it’s also the source of the architecture’s assumptions, as we’ll see.
This learned generalization comes at a cost. The network needs experience to discover useful features. Early in training, its hidden units are computing essentially random functions of the input — nothing meaningful. Only after enough gradient updates do the features crystallize into something useful. The linear agent, given good hand-crafted features, starts generalizing from episode one. The neural agent has to earn its generalization from scratch. This is the sample efficiency trade-off at the heart of representation learning.
The MLP in this demo
The neural agent running in this post uses a two-layer MLP: the simplest possible deep network. The input is the same feature vector used by the linear agent. The first layer (the hidden layer) multiplies the input by a weight matrix, adds a bias, and applies ReLU — zeroing out any negative values, keeping the rest. The second layer reads the hidden activations and produces one Q-value per action. Nothing else. No convolutions, no attention, no recurrence.
The hidden layer has 20 neurons. Each neuron learns one feature — some nonlinear combination of the inputs that ends up being useful for predicting value. With 20 learned features feeding into 4 action outputs, the network has roughly 200 parameters in total. That’s still a tiny model. But those 200 parameters are doing something qualitatively different from the linear agent’s handful of weights: they’re discovering, through gradient descent, which nonlinear patterns in the state predict future reward. The features aren’t fixed; they’re learned. The architecture just determines the space of features that can be learned — which brings us to the real point.
Three agents, one maze
Watch the three agents train side-by-side on the same maze. The tabular agent memorizes individual cells — watch how its value heatmap fills in slowly, state by state, with no structure to it. The linear agent generalizes across distance features — its heatmap has smooth gradients from the moment it starts updating, because updating the weight for “distance to item” immediately affects every state. The neural agent learns spatial patterns the others can’t see. Use sparse walls to give the neural agent space to show its advantage.
The key difference shows up in generalization. After a few hundred episodes, restart the agents and change the grid size. The tabular agent forgets everything — a new grid is an entirely new problem. The linear agent retains its weights and keeps generalizing from its features, which are grid-size independent. The neural agent also retains its weights, and may even have learned something about the spatial structure of the problem that transfers.
Train once, play anywhere
Here’s a concrete way to probe what each agent actually learned. Hit Play to train all three agents for a few hundred episodes on the current layout. Then hit New Layout to regenerate the grid with a completely different arrangement of walls, items, and the exit — the agents keep their trained weights but face a maze they’ve never seen. Now hit Watch to see each agent navigate the new layout using pure exploitation: no random exploration, just the policy they learned.
What you’ll see is the generalization gap made visible. The tabular agent will likely stumble — the new positions don’t appear in its Q-table, so it has no learned values for the states it actually encounters, and falls back to arbitrary choices. The linear agent will often transfer reasonably well, because its features (“distance to nearest item,” “normalized x position”) are layout-independent: the weights it learned about what those features predict still apply on the new grid. The neural agent sits somewhere in between — it has learned nonlinear combinations of those same features, and some of those combinations will transfer while others won’t, depending on how well its hidden-layer representations happened to capture genuinely layout-invariant structure.
This is what sample efficiency looks like in practice. The tabular agent memorized a map; New Layout throws that map away. The linear agent learned principles; New Layout tests whether those principles generalize. The neural agent learned something in between — richer than explicit principles, but not as brittle as memorization. How well it transfers depends entirely on which features the hidden layer discovered, which in turn depends on the architecture’s inductive bias. The architecture determined what could be learned, and training determined what was.
Architecture = Prior = Inductive Bias
Here is the claim, stated plainly: the architecture is the Bayesian prior over value functions. Not a metaphor. Not an analogy. When you choose a two-layer MLP, you are saying “I believe the value function can be expressed as smooth, nonlinear combinations of the input features.” You are ruling out value functions that can’t be expressed that way. Every architectural decision you make narrows the hypothesis space — the set of value functions the agent can even represent — and that narrowing is exactly what a prior does.
Consider what different architectures are saying. A fully-connected MLP says “I believe any input feature could interact with any other input feature in determining value — and those interactions might be nonlinear.” A CNN says “I believe spatial locality matters — nearby cells have related values, and the same patterns can appear anywhere on the grid.” A transformer says “I believe attention patterns matter — some subset of the input should attend to some other subset, and that relationship is dynamic.” A recurrent network says “I believe the history of states matters, not just the current state.” These aren’t just implementation choices. They’re commitments. They’re bets about the structure of the problem.
In Bayesian terms: you choose a prior, then learning updates it. An algorithm with a strong, well-matched prior can learn from very little data. An algorithm with a weak or mismatched prior needs to see much more. Tabular Q-learning has an almost degenerate prior — it says “every state is completely independent of every other state,” which is a very strong prior in the other direction: a prior that prevents any generalization at all. Linear approximation has a prior that says “value is a linear function of these specific features.” The MLP prior is weaker — it allows nonlinear combinations — but still constrains the hypothesis space in ways that make learning possible. The Solomonoff prior would place weight on every computable value function, weighted by simplicity. That’s the ideal no practical system can compute. We’re all approximating it with our architectural choices.
What the net can’t learn
Our tiny MLP can discover nonlinear patterns the linear agent can’t represent. Given enough training data, it can learn that a particular configuration of walls and items near the player’s position calls for a specific action, even if that pattern can’t be expressed as a linear combination of distances. That’s real. But the MLP is also blind in a way a convolutional network wouldn’t be. Our MLP treats position (3, 4) and position (3, 5) as equally unrelated dimensions. It doesn’t know they’re adjacent. It doesn’t know that adjacent cells tend to have similar values. It can learn that they do — but only if it sees enough data from both positions to draw the connection. A CNN would bake that assumption in from the start: spatial locality is the prior, not something to be learned.
On a structured maze, a CNN-like architecture would notice that the grid has spatial regularity — that the value of a cell is related to the values of its neighbors, that walls form connected structures, that the distance to the exit changes smoothly across the grid. Our MLP can model any of this in principle, but it has to spend parameters and training data discovering regularities that a CNN gets for free from its architecture. The inductive bias of the CNN matches the structure of the problem. The MLP’s inductive bias is weaker. This is the no-free-lunch theorem made concrete: a weaker prior requires more data, and a prior that matches the problem’s structure is worth more than any algorithm improvement.
What comes next
Three posts in, the pattern is clear. Tabular: no assumptions about state similarity, and you pay for it in sample complexity. Linear: explicit assumptions baked into hand-crafted features, efficient when you’re right and limited when you’re wrong. Neural: assumptions baked into architecture, flexible within that architecture but blind outside it. Every algorithm is making bets. Every representation is a prior. The final post pulls this thread to its end: what would it look like to make the ideal assumptions — and what does that tell us about everything we’ve been building? Next: What You Assume vs. What You Compute.
Discussion