A Network Computes Numbers. The Loss Decides What They Mean.
Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.
Browse posts by tag
Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.
An attention head is a learned content-addressable lookup: a query matches keys, retrieves a value, exactly like dereferencing a pointer. Depth is how many lookups you can compose.
The trained pointer model reads exactly the right memory cell, provably. Its attention barely shows where. The gap, and the causal probe that closes it.
The simplest neural language model: embed the last N tokens, concatenate, predict the next. What you give up with a fixed window, and what you gain by dropping recurrence.
A recurrent network reuses one cell at every timestep and carries a state. That buys time-translation equivariance and unbounded reach in principle, and bills you in vanishing gradients.
When the only signal is a number at the end of a trajectory. How REINFORCE turns out to be the same gradient as classification, scaled by return, and where the theory tops out.
Choosing a loss is choosing a distribution for your output. Why every supervised network is a maximum-likelihood estimator, and why the gradient is always the residual.
A convolution is a bet about images: nearby pixels matter together, and a feature detector should fire anywhere. How to test whether a model is actually using that bet.
Part 4 of What Your RL Algorithm Actually Assumes — model-based vs. model-free, the assumptions table, AIXI as the incomputable ideal, and the unifying claim: representation is prior is assumption.
Part 3 of What Your RL Algorithm Actually Assumes — the architecture decides what kind of features can be learned, and that decision is a Bayesian prior over value functions.