A convolution is two assumptions made structural. Locality: a unit looks at a small patch, not the whole image. Translation equivariance: the same patch detector runs at every position, so a thing learned in one corner is known in every corner. Weight sharing is how you bake both into the architecture instead of hoping the model discovers them.
The previous posts treated the architecture as a choice about which functions are reachable. The convolution is the first architecture in this series that commits to a real symmetry of the data, and it is worth being precise about what that commitment is and how you check whether it is paying off.
One kernel, reused everywhere
A convolutional layer has one small kernel and slides it across the image, applying the same weights at every position. That is the whole definition. An ordinary linear layer gives every output its own weights tied to specific input positions; the convolution gives every position the same weights. Translating the input just slides the output. The equivariance is a property of the operator, true at initialization and after training, not something the model has to learn.
There is a tidy consequence in the backward pass. Because one kernel weight is read at every position in the forward pass, the gradient for that weight is a sum over every position in the backward pass. Weight sharing in the forward direction and gradient accumulation in the backward direction are the same fact seen from two sides. This += over a shared axis is a motif that comes back later for recurrence (sharing across time) and attention.
The immediate payoff is parameters. A small CNN matches a larger MLP on a digit-classification task with a fraction of the weights, because it does not relearn “an edge looks the same in the top-left as in the bottom-right” separately for every location. It learns the edge once.
How to tell the bias is doing work
Matching an MLP with fewer parameters is suggestive, not proof. Maybe the CNN is just getting by. The honest way to check whether the locality prior is actually load-bearing is to break it and watch what happens.
Take the pixels of each image and scramble them with one fixed random permutation, the same shuffle for every image, then retrain. To a human the images are now noise. To the MLP, almost nothing changes: it treated the pixels as an unordered bag of features anyway, so a fixed relabeling is invisible. To the CNN, the locality assumption is now a lie. Neighbors in the scrambled image are not neighbors in the digit, so a small kernel sees an arbitrary handful of unrelated pixels. The CNN’s accuracy drops more than the MLP’s.
That gap is the measurement. The direction (the CNN is hurt more) confirms the prior is real; the size of the gap is how much the prior was contributing. This is what it looks like to measure an inductive bias rather than just assert one, and it generalizes: if removing a structural assumption costs you nothing, the model was not using it.
One honest caveat. On the small pre-pooled digit set I use, the gap is modest, because the dataset’s own preprocessing already did most of the spatial work. On raw pixels the gap is large. The method is the lesson; the magnitude depends on how much structure survives into your inputs.
The actual Conv2D forward and backward (the += accumulation written out), the parameter-count comparison, and the full permuted-pixel experiment are in Chapter 3 of the book, Inductive Biases in Neural Networks.
Discussion