Skip to main content
← All Series

Inductive Biases in Neural Networks

A from-scratch tour of why each neural architecture works, built as a pure-Python library you can read. Eight bite-sized posts, and a full monograph.

8 parts

Every useful neural network is a bundle of assumptions about the data. A model that assumes nothing learns nothing, so the real question is never “is it powerful enough” but “are its assumptions right.” Those assumptions have a name. They are inductive biases, and the right one is what lets a model learn more from less.

This series takes that idea apart by building the major architectures from scratch, in pure Python, and asking of each one: what does it assume, and how would you check whether the assumption is paying off.

The frame

Inductive bias enters at three places, and the series is organized around them:

  • The architecture: which functions are reachable and cheap. A single linear unit cannot represent XOR; an MLP can. A convolution assumes locality; an RNN assumes the same computation at every timestep; attention assumes nothing fixed and learns which positions matter.
  • The output head: what the outputs mean. The loss is a distribution assumption in disguise, and choosing it (Bernoulli, Categorical, Gaussian, Poisson) is a modeling decision in its own right.
  • Implementation realization: whether gradient descent can actually find the solution the architecture permits, and whether you can read that solution back out afterward. This one only shows up once you train real models, and it has teeth.

The posts

  1. A network computes numbers; the loss decides what they mean. Function approximation, XOR, and what a hidden layer buys.
  2. The loss function is a distribution assumption. Why every supervised network is a maximum-likelihood estimator, and why the gradient is always the residual.
  3. What a convolution assumes. Locality and translation equivariance, and how to test whether a model is using them.
  4. Bengio’s language model. The Markov assumption made architectural, and what dropping recurrence buys.
  5. Recurrence is weight sharing across time, and it costs you. Backprop through time and the vanishing gradient as the price of the prior.
  6. Attention is a learned pointer dereference. Content-addressable memory, and why depth is the number of lookups you can compose.
  7. Attention weight is not information flow. A model that provably reads the right memory cell while its attention refuses to show you where.
  8. Reinforcement learning is cross-entropy, reweighted by reward. When the only signal is a number at the end of a trajectory.

The book

The posts are the on-ramp. The full treatment, with the hand-derived backward passes, the actual library code, the experiments, and the derivations the posts only gesture at, is collected in a single systematic monograph: Inductive Biases in Neural Networks. It is featured at the top of this page; read it online or download the PDF there.

This series is the build-it-from-scratch companion to The Learning Problem, which works the same territory from the side of theory.

The Monograph

Inductive Biases in Neural Networks

A from-scratch monograph on why each neural architecture works, built on a pure-Python library you can read. One lens, inductive bias along three axes (the architecture, the output head, and whether training can actually find the solution), applied end to end: from why a single linear unit cannot learn XOR through to reverse-engineering a …

112 pages · book

Posts in this Series

Showing 8 of 8 posts