Every intelligent system — a chess engine, a self-driving car, a robot arm — has to answer the same question at every moment: what should I do next? One way to answer it is to precompute the answer for every situation you might ever face, write it down in a giant lookup table, and consult that table when the time comes. This is not a bad idea. In fact, it’s optimal by definition: if you’ve correctly filled in every entry, you’ll always make the right move.
The problem is the table’s size. For a simple 8x8 grid with three collectible items, the state can be described as a tuple (x, y, bitmask) where x and y give the player’s position and the bitmask records which of the three items have been collected. That’s 64 positions times 2^3 item subsets — 512 states in total, times 4 actions each, for 2,048 entries. Manageable. But scale to a 20x20 grid with six items and you’re at 25,600 states and over 100,000 entries. Add enemies that move around, and the state space explodes combinatorially. For any real problem, the table is not just large — it’s cosmically large. Learning or storing it is not remotely feasible.
That’s the setup for this whole series. Every RL algorithm you’ll ever use is trying to solve this problem: how do you act intelligently without the full table? The rest of the series is about the different answers different algorithms give, and — this is the key point — what each answer assumes about the world. But first, let’s look at the honest baseline: the algorithm that makes zero assumptions and pays for it.
Try it yourself
Before we get into algorithms, play the game. Use arrow keys to move the agent (@), collect items ($), and reach the exit (>). Walls are #. Click the canvas first to focus it, then use the arrow keys. This is the environment every algorithm in this series will be trained on.
Notice how you play. You don’t enumerate states — you recognize patterns. “There’s an item to my right, and the exit is in the corner.” That pattern-recognition is exactly what good representations encode. Tabular Q-learning doesn’t do any of that. It treats every grid position as a completely independent, unrelated fact.
The state space explosion
Let’s be concrete about the numbers. An 8x8 grid with 3 items has 512 states. A 10x10 grid with 4 items has 1,600 states. A 16x16 grid with 5 items has 4,096 positions times 32 subsets — 131,072 states. The state count doubles every time you add an item, and grows as the square of the grid width. These aren’t astronomical numbers yet, but the agent doesn’t visit states by enumeration. It visits them by stumbling into them during exploration. With a 20x20 grid and 6 items, you have over 1.6 million state-action pairs. At 200 steps per episode, you’d need on the order of 8,000 episodes just to visit each pair once — and Q-learning needs many visits to converge.
The deeper problem is that tabular methods can’t generalize. When the agent learns that standing at position (3, 5, 0b110) with two items collected is a good situation to be in, that knowledge is locked to that exact state. Position (4, 5, 0b110) — one step to the right, same items collected — might as well be on a different planet. The table has no concept of “nearby” or “similar.” Every state is equally unlike every other state.
This is not a bug in tabular Q-learning. It’s the feature. Tabular methods make no assumptions about how states relate to each other. That’s why they’re the honest baseline: whatever they learn is provably correct about the states they’ve seen, and they claim nothing about the rest. They just happen to have seen almost none of what matters.
Enter tabular Q-learning
The algorithm is conceptually simple. The agent maintains a table Q(s, a) estimating how good it is to take action a in state s. At each step, it takes an action (randomly at first, then increasingly guided by its table), observes the reward, and updates the relevant table entry using the Q-learning rule:
Q(s, a) <- Q(s, a) + alpha * [r + gamma * max_a' Q(s', a') - Q(s, a)]
The agent starts knowing nothing — every entry is zero. Over thousands of episodes, rewards propagate backward through the table: states that lead to good outcomes get higher values, states that lead to walls or timeouts get lower ones. The heatmap overlay on the widget below shows this in real time: brighter cells are states the agent currently thinks are valuable. Watch how the heatmap starts blank and slowly fills in as the agent explores.
The stat line shows the current episode count, the total number of distinct states the agent has visited, and the rolling average reward over the last 100 episodes. That “States” number is the size of the table the agent is maintaining. It only grows. It never shrinks. And crucially, you can watch epsilon decay: early on it’s near 1.0, meaning the agent explores randomly most of the time. As it decays toward 0.01, the agent exploits its table more. But on large grids, epsilon often reaches its minimum before the table is remotely converged — the agent stops exploring before it’s seen enough.
Scale it up
Try increasing the grid size slider above to 16x16, then 20x20. Watch the “States” count grow. Watch the average reward plateau much lower — the agent simply can’t visit enough of the space to learn a good policy before epsilon bottoms out. The convergence curve flattens. The heatmap stays mostly dark. This isn’t a hyperparameter problem you can tune your way out of: it’s the fundamental cost of making zero assumptions. You can slow epsilon decay, add more episodes, increase the learning rate — none of it changes the core issue that the table has 100,000 entries and the agent is filling them in one stumble at a time.
At 8x8 with 3 items, tabular Q-learning works fine. It’s even kind of elegant: provably convergent, no moving parts, nothing to tune except a handful of scalars. The 512-state table fills in within a few thousand episodes and the agent learns a reasonable policy. That’s the regime tabular methods were designed for. Outside it, you need something else.
What comes next
What if knowing that the nearest item is 3 steps to the right told you something useful regardless of your exact grid coordinates? What if the distance to the exit, the number of items remaining, and whether an enemy is between you and it could collectively predict how good a state is — without having to memorize each position separately? That kind of knowledge would let you transfer what you learned at (3, 5) to (4, 5) and (7, 2) and every other position with the same geometric relationship to the items and exit. That’s the idea behind function approximation. Next, we’ll look at what happens when you give the algorithm a set of features to work with — and what assumptions those features smuggle in.
Discussion