Over the first three posts we looked at three ways to represent what an agent knows: a table with one row per state, a weighted sum of hand-crafted features, and a neural network that learns its own features. In each case, we saw that the representation wasn’t just a technical choice — it was a claim about the world. The tabular agent claims that no two states share structure. The linear agent claims that value is a weighted combination of its chosen features. The neural agent encodes its claim in the architecture itself, in the shape of the convolutional or fully-connected layers it uses to process input.
But there’s another axis we haven’t talked about yet, one that cuts across all three representations. It’s not about how you encode states — it’s about whether you assume you know how the world works. This turns out to be at least as fundamental as the representation question, and it sets up the deepest question in this series: what would it even mean for an agent to make no assumptions at all?
Model-based vs. model-free: two ways to be wrong
Value iteration is the clearest example of a model-based method. You give it a transition function — a table saying “if you take action A in state S, here’s the probability distribution over next states” — and it computes the optimal value function by iterating the Bellman equation until convergence. The result is provably optimal. No exploration needed. No sample noise. Just arithmetic until the values stop changing.
The catch is that “give it a transition function” is doing enormous work. In a board game, you can write down the rules and the transition function falls out. In most real problems — a robot learning to walk, a trading algorithm responding to markets, an agent navigating an environment it’s never seen before — you don’t have the transition function. You have experience. You have outcomes. You have the environment’s revealed behavior, one step at a time.
That’s the regime model-free methods were built for. Q-learning doesn’t need to know what happens when it takes action A in state S — it finds out, records the outcome, and slowly builds up an implicit model through the Q-values themselves. The Q-function is a compressed version of the optimal policy, learned from samples without ever representing transitions explicitly. The cost is sample complexity: every state-action pair has to be visited enough times that the Q-values converge. Model-based methods front-load the assumption (you know the model); model-free methods front-load the data requirement (you learn by doing).
Both approaches make assumptions. Value iteration assumes you have a complete, correct model of the world. Q-learning assumes the world can be summarized by a Q-function of a particular form — tabular, linear, or neural, depending on your choice of representation. Neither assumption is free. The art is knowing which assumption your problem tolerates.
What each algorithm actually bets on
Let’s put the assumptions on the table explicitly. Every algorithm in this series commits to a specific combination of choices: how it represents states, what it assumes about world dynamics, how it learns, and what it sacrifices in exchange.
| Algorithm | Representation | Model Assumption | How It Learns | What It Sacrifices |
|---|---|---|---|---|
| Tabular Q | None — each state is unique | Model-free No transition knowledge needed | Visits every state-action pair; updates that entry only | Generalization — every state is an island |
| Linear Q | Linear in hand-crafted features | Model-free No transition knowledge needed | Updates shared feature weights from each experience | Non-linear patterns — the world must respect the feature decomposition |
| Neural Q | Learned features (architecture-defined) | Model-free No transition knowledge needed | Backpropagates through the network from every sample | Interpretability and training stability — power comes at a cost |
| Value Iteration | Tabular (one value per state) | Full model Transition probabilities must be known | Bellman backups — no sampling, pure computation | Scalability and model availability — exact when possible, impractical often |
The table makes visible what informal descriptions obscure: every column represents a real tradeoff. The model-free methods have a None in the representation column — but that’s not the same as “no assumptions.” It means “no assumptions about transitions.” They still assume the Q-function has a particular form. The model-based method reverses the bet: it assumes a great deal about the world (a complete transition model) in exchange for being able to compute rather than sample its way to the answer.
The incomputable ideal
Here’s a question that might seem absurd: what would an agent look like if it made no assumptions? Not “fewer assumptions” — none. Consider every possible way the world could work: every possible set of transition probabilities, every possible reward structure. Weight each world model by its simplicity — the simpler the description, the higher the prior probability. Then pick the action that maximizes expected reward across all these world models simultaneously, where each model’s contribution is weighted by how consistent it is with everything the agent has observed so far.
That’s AIXI, Marcus Hutter’s theoretical ideal agent. It doesn’t commit to any world model in advance — it maintains a distribution over all computable world models and updates it as evidence comes in. Given infinite computation, AIXI is provably optimal in any computable environment. It will eventually learn to behave optimally in any world it inhabits, regardless of how that world works, because it never rules out the true model.
AIXI is incomputable. You cannot run it. The set of all computable programs is infinite and cannot be enumerated in finite time. But this isn’t a bug in the definition — it’s the point. AIXI defines what “optimal” means when you refuse to assume anything about the world. It gives us a target. Every practical algorithm is AIXI with the knobs turned down:
“I’ll only consider this one world model” — that’s value iteration. “I’ll skip the model and learn values directly” — that’s Q-learning. “I’ll use this network architecture as my prior” — that’s neural Q-learning.
The shortcuts are the assumptions. The distance between any practical algorithm and AIXI is precisely the set of assumptions that algorithm makes. A stronger assumption buys computational tractability. A weaker assumption costs more data or more compute to compensate. AIXI, making no assumptions, costs infinite compute. The rest of the landscape is a continuous tradeoff between the two.
Representation is prior is assumption
The deepest unifying idea in this series is that choosing a representation and choosing a prior are the same choice. When you use a tabular representation, you’re saying: “I assume no state shares structure with any other.” When you use linear features, you’re saying: “I assume value is linear in these specific dimensions.” When you use a convolutional network, you’re saying: “I assume spatial locality matters — nearby inputs should be processed together.” These aren’t just engineering decisions. They’re claims about the structure of the problem you’re solving.
There is no assumption-free algorithm. The No Free Lunch theorem — in formal terms — says that no learning algorithm outperforms random search when averaged over all possible problems. The corollary is that every algorithm that works well on some problems must work badly on others. The ones that look assumption-free aren’t really — they’re just making weaker or less explicit assumptions. AIXI “assumes” that the world is computable. That’s a very weak assumption, but it’s still one. The art of choosing an algorithm is the art of matching your assumptions to your problem’s structure. If your problem has spatial structure, use an architecture that encodes spatial priors. If it has linear feature structure, use a linear approximator. If it has a small, fully known state space, use value iteration. The algorithm that ignores this structure wastes data and compute compensating for assumptions it could have made for free.
Looking back
Go back to Post 1 and watch the tabular agent struggle on a 16x16 grid. That’s not a hyperparameter problem. The agent is paying for its refusal to assume anything about how states relate — it visits each one in isolation, learns nothing transferable, and plateaus long before it has seen enough of the space. Then jump to Post 2 and watch how seven simple features solve the same problem in a fraction of the episodes. That’s not magic — that’s the right assumptions meeting the right structure. The features encode what the problem’s geometry actually implies about value, and the algorithm exploits that encoding directly. Post 3 shows what happens when you let the network choose its own features — more power, less interpretability, more training instability. More assumptions implicit in the architecture, fewer explicit in your feature engineering.
Every time you choose an algorithm, you’re making a bet about the world. You’re betting that your representation captures the structure that matters, that your model assumptions are close enough to right, that the tradeoffs you’re accepting in sample complexity or interpretability or scalability are worth the capabilities you’re gaining. The algorithms don’t tell you which bet to make. That part is still yours. But now you can see what bet you’re making.
Discussion