A Network Computes Numbers. The Loss Decides What They Mean.
Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.
A from-scratch tour of why each neural architecture works, built as a pure-Python library you can read. Eight bite-sized posts, and a full monograph.
Every useful neural network is a bundle of assumptions about the data. A model that assumes nothing learns nothing, so the real question is never “is it powerful enough” but “are its assumptions right.” Those assumptions have a name. They are inductive biases, and the right one is what lets a model learn more from less.
This series takes that idea apart by building the major architectures from scratch, in pure Python, and asking of each one: what does it assume, and how would you check whether the assumption is paying off.
Inductive bias enters at three places, and the series is organized around them:
The posts are the on-ramp. The full treatment, with the hand-derived backward passes, the actual library code, the experiments, and the derivations the posts only gesture at, is collected in a single systematic monograph: Inductive Biases in Neural Networks. It is featured at the top of this page; read it online or download the PDF there.
This series is the build-it-from-scratch companion to The Learning Problem, which works the same territory from the side of theory.
A from-scratch monograph on why each neural architecture works, built on a pure-Python library you can read. One lens, inductive bias along three axes (the architecture, the output head, and whether training can actually find the solution), applied end to end: from why a single linear unit cannot learn XOR through to reverse-engineering a …
Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.
Choosing a loss is choosing a distribution for your output. Why every supervised network is a maximum-likelihood estimator, and why the gradient is always the residual.
A convolution is a bet about images: nearby pixels matter together, and a feature detector should fire anywhere. How to test whether a model is actually using that bet.
The simplest neural language model: embed the last N tokens, concatenate, predict the next. What you give up with a fixed window, and what you gain by dropping recurrence.
A recurrent network reuses one cell at every timestep and carries a state. That buys time-translation equivariance and unbounded reach in principle, and bills you in vanishing gradients.
An attention head is a learned content-addressable lookup: a query matches keys, retrieves a value, exactly like dereferencing a pointer. Depth is how many lookups you can compose.
The trained pointer model reads exactly the right memory cell, provably. Its attention barely shows where. The gap, and the causal probe that closes it.
When the only signal is a number at the end of a trajectory. How REINFORCE turns out to be the same gradient as classification, scaled by return, and where the theory tops out.