Inductive Biases in Neural Networks

A from-scratch tour of why each neural architecture works, built as a pure-Python library you can read. Eight bite-sized posts, and a full monograph.

8 parts

Every useful neural network is a bundle of assumptions about the data. A model that assumes nothing learns nothing, so the real question is never “is it powerful enough” but “are its assumptions right.” Those assumptions have a name. They are inductive biases, and the right one is what lets a model learn more from less.

This series takes that idea apart by building the major architectures from scratch, in pure Python, and asking of each one: what does it assume, and how would you check whether the assumption is paying off.

The frame

Inductive bias enters at three places, and the series is organized around them:

The architecture: which functions are reachable and cheap. A single linear unit cannot represent XOR; an MLP can. A convolution assumes locality; an RNN assumes the same computation at every timestep; attention assumes nothing fixed and learns which positions matter.
The output head: what the outputs mean. The loss is a distribution assumption in disguise, and choosing it (Bernoulli, Categorical, Gaussian, Poisson) is a modeling decision in its own right.
Implementation realization: whether gradient descent can actually find the solution the architecture permits, and whether you can read that solution back out afterward. This one only shows up once you train real models, and it has teeth.

The posts

A network computes numbers; the loss decides what they mean. Function approximation, XOR, and what a hidden layer buys.
The loss function is a distribution assumption. Why every supervised network is a maximum-likelihood estimator, and why the gradient is always the residual.
What a convolution assumes. Locality and translation equivariance, and how to test whether a model is using them.
Bengio’s language model. The Markov assumption made architectural, and what dropping recurrence buys.
Recurrence is weight sharing across time, and it costs you. Backprop through time and the vanishing gradient as the price of the prior.
Attention is a learned pointer dereference. Content-addressable memory, and why depth is the number of lookups you can compose.
Attention weight is not information flow. A model that provably reads the right memory cell while its attention refuses to show you where.
Reinforcement learning is cross-entropy, reweighted by reward. When the only signal is a number at the end of a trajectory.

The book

The posts are the on-ramp. The full treatment, with the hand-derived backward passes, the actual library code, the experiments, and the derivations the posts only gesture at, is collected in a single systematic monograph: Inductive Biases in Neural Networks. It is featured at the top of this page; read it online or download the PDF there.

This series is the build-it-from-scratch companion to The Learning Problem, which works the same territory from the side of theory.

The Monograph

Inductive Biases in Neural Networks

A from-scratch monograph on why each neural architecture works, built on a pure-Python library you can read. One lens, inductive bias along three axes (the architecture, the output head, and whether training can actually find the solution), applied end to end: from why a single linear unit cannot learn XOR through to reverse-engineering a …

112 pages · book

Read the book Download PDF

Posts in this Series

Showing 8 of 8 posts

1 of 8

A Network Computes Numbers. The Loss Decides What They Mean.

June 9, 2026 5 min read

Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.

→

2 of 8

The Loss Function Is a Distribution Assumption

June 9, 2026 4 min read

Choosing a loss is choosing a distribution for your output. Why every supervised network is a maximum-likelihood estimator, and why the gradient is always the residual.

neural-networks inductive-bias maximum-likelihood generalized-linear-models +1

→

3 of 8

What a Convolution Assumes

June 9, 2026 3 min read

A convolution is a bet about images: nearby pixels matter together, and a feature detector should fire anywhere. How to test whether a model is actually using that bet.

neural-networks inductive-bias convolutional-networks translation-equivariance +1

→

4 of 8

Bengio's Language Model: the Markov Assumption Made Architectural

June 9, 2026 3 min read

The simplest neural language model: embed the last N tokens, concatenate, predict the next. What you give up with a fixed window, and what you gain by dropping recurrence.

neural-networks inductive-bias language-models embeddings

→

5 of 8

Recurrence Is Weight Sharing Across Time, and It Costs You

June 9, 2026 3 min read

A recurrent network reuses one cell at every timestep and carries a state. That buys time-translation equivariance and unbounded reach in principle, and bills you in vanishing gradients.

neural-networks inductive-bias recurrent-networks backpropagation-through-time +1

→

6 of 8

Attention Is a Learned Pointer Dereference

June 9, 2026 3 min read

An attention head is a learned content-addressable lookup: a query matches keys, retrieves a value, exactly like dereferencing a pointer. Depth is how many lookups you can compose.

neural-networks inductive-bias transformers attention +1

→

7 of 8

Attention Weight Is Not Information Flow

June 9, 2026 4 min read

The trained pointer model reads exactly the right memory cell, provably. Its attention barely shows where. The gap, and the causal probe that closes it.

neural-networks inductive-bias interpretability induction-heads +1

→

8 of 8

Reinforcement Learning Is Cross-Entropy, Reweighted by Reward

June 9, 2026 3 min read

When the only signal is a number at the end of a trajectory. How REINFORCE turns out to be the same gradient as classification, scaled by return, and where the theory tops out.

neural-networks inductive-bias reinforcement-learning policy-gradient +1

→