In the last post, we watched a tabular Q-learning agent struggle with a modest 8x8 grid. The problem wasn’t the algorithm. It was the representation: one entry per state, no generalization, every cell of the grid a stranger to every other. Learn something about position (3, 4)? Congratulations — you know nothing about position (3, 5).
That failure points toward a question worth sitting with: what if states could share knowledge? What if “the nearest item is 3 steps to the right” meant something consistent, regardless of whether you’re standing in the top-left corner or the bottom-right? That would let learning from one situation carry over to others. Instead of a table with a hundred thousand entries, you’d have a handful of weights capturing something real about the problem. That’s the idea behind linear function approximation — and the catch is sharper than it first appears.
A weighted sum instead of a lookup
Linear function approximation replaces the table with an equation. Instead of storing Q(s, a) for every (state, action) pair, you compute it:
Q(s, a) = w1*f1(s) + w2*f2(s) + ... + wn*fn(s)
The features f1(s), f2(s), … are functions you write. They map any state to a number — normalized distance to the nearest item, signed direction to the exit, fraction of items still uncollected. The weights w1, w2, … are what the algorithm learns. With 7 features and 4 actions, that’s 32 parameters instead of hundreds of thousands of Q-table entries. The weights are the same everywhere; only the feature values change as the state changes.
The update rule is a gradient step on the weights, not a lookup into a table. When the agent visits a state and gets a reward, it nudges every weight slightly — proportionally to how much that feature influenced the Q-value estimate. Because the weights are shared across all states, learning in one corner of the grid immediately affects predictions everywhere. That’s the generalization the tabular approach was missing.
What you’re learning, then, isn’t “state X is worth Y.” You’re learning “proximity to items tends to pull value up, and distance to the exit pulls it down.” Those are claims about the structure of the problem — not observations about specific cells.
Try it: feature explorer
The widget below runs linear Q-learning with a set of hand-crafted features. Toggle them on and off, then hit Play and watch how the agent’s learned value map responds.
Remove “Nearest Item Distance” and the agent loses its main signal for navigating toward collectibles — it can still stumble onto them by accident, but purposeful approach disappears. Add “Wall Density” and the agent starts learning to avoid tight corridors where it gets pinned. Every feature you enable is a hypothesis about what matters. Every feature you disable is a bet that it doesn’t.
Notice the heatmap as training progresses. With a good feature set, value gradients form quickly — the agent builds a rough map of the grid within a few hundred episodes. Strip it back to two or three features and the map goes blurry. The weights are still updating; they just have less to work with.
The features are the assumptions
When you add “distance to nearest item” as a feature, you’re committing to two claims. First: proximity to items is relevant for decision-making. Second: it’s relevant in the same way, everywhere on the grid. A weight of -0.4 on that feature means the same thing whether the agent is in cell (1, 1) or cell (15, 15). The weight can’t know where it is — only what the feature value is.
That’s a real assumption. In this gridworld, it happens to be a good one: items are worth collecting regardless of grid location, and being closer to one is almost always better than being further away. But you could imagine problems where it breaks down. An item behind a locked door is physically close but strategically irrelevant. Raw Manhattan distance has no way to express that. Your feature would encode the wrong thing, and no amount of training corrects a feature that’s asking the wrong question.
This is why feature engineering is never just a technical step. Every feature you include encodes a belief about what the agent should care about. Every feature you exclude is a claim that it doesn’t matter. Get those beliefs right and linear approximation is remarkably efficient — far better than any tabular approach on the same problem. Get them wrong and you can train forever without converging on anything useful.
Same problem, different representation
Here’s the comparison that makes the difference concrete. Both agents use Q-learning. Same reward structure, same environment, same update rule. The only difference is representation: one keeps a lookup table, one computes a weighted sum of features. Watch what happens on a 16x16 grid with 4 items to collect.
The linear agent converges in seconds — its value map sharpens within a few hundred episodes, and average reward climbs. The tabular agent barely starts. On a 16x16 grid, the state space (position x item collection bitmask) runs into millions of entries. At any reasonable training speed, the tabular agent will die of old age before it’s seen most of its own state space once. The linear agent never had that problem. It generalized from day one.
No free lunch
There’s a flip side. The linear agent’s efficiency comes entirely from the assumption that value is well-approximated as a linear function of the chosen features. In this gridworld, that assumption holds up reasonably well. But the moment you move to a problem with complex interactions — where the right action depends on the combination of features in a way that no linear function can capture — the linear approximation breaks down. It won’t diverge dramatically; it’ll just asymptote at mediocre performance and stay there, because the model can’t express what the problem actually requires.
There’s no universally right feature set. There’s no universally right representation. The No Free Lunch theorem makes this precise: averaged across all possible problems, every approach performs the same. What matters is whether your inductive bias — the structure baked into your representation — matches the structure of the problem you’re actually solving. Match the two, and a simple linear model with good features crushes a tabular agent with a hundred times the compute. Mismatch them, and the features become a liability: the agent will learn a confident, consistent, and wrong set of weights.
What comes next
The obvious question is: what if you don’t know which features matter? Feature engineering requires domain knowledge. It requires looking at the problem and making educated guesses about what structure is there to exploit. What if the representation could figure that out on its own — learning not just the weights, but the features themselves? That’s where neural networks enter. The architecture of the network is itself a set of assumptions, but they’re different assumptions: about locality, about hierarchy, about what kinds of patterns are worth detecting. That’s next.
Discussion