Skip to main content

The Loss Function Is a Distribution Assumption

People treat the loss function as a tuning detail, something you pick from a menu and move on. It is not a detail. Choosing a loss is choosing a probability distribution for your output. Once you see that, a lot of scattered facts collapse into one.

The previous post left the network producing raw scores (logits) and handed the job of interpretation to the loss. This post is about what that interpretation actually commits you to.

Every supervised network is a maximum-likelihood estimator

The setup is always the same. The network emits logits. A link function maps them to the parameter of some distribution over the output, and the loss is the negative log likelihood of the data under that distribution. Minimizing the loss is maximizing likelihood. The loss is not arbitrary; it is whatever distribution you assumed, written as a penalty.

So the menu of losses is really a menu of assumptions:

  • Binary outcome, Bernoulli: sigmoid link, binary cross-entropy.
  • One of $K$ classes, Categorical: softmax link, cross-entropy.
  • A real number, Gaussian with constant variance: identity link, mean squared error.
  • A count, Poisson: log link, Poisson negative log likelihood.

Mean squared error is not “the regression loss.” It is the assumption that your targets are Gaussian with constant variance. If that is wrong, MSE is the wrong prior, and you pay for it.

The gradient is always the residual

Here is the fact that looks like a coincidence and is not. For every pair above, the gradient of the loss with respect to the logits is

$$\frac{\partial L}{\partial \mathbf{z}} = \hat{p} - y,$$

the predicted mean minus the target. Logistic regression, softmax, least squares, Poisson regression: the same shape every time.

It is not luck. These are all exponential-family distributions paired with their canonical link, and “the gradient is the residual” is the defining property of the canonical link. You pick the distribution, the matching link comes with it, and the clean gradient falls out. When you can write the backward pass once and reuse it across five losses without changing a line, this is the reason.

Why this is an inductive bias

The output head is a bet about the data on the response side, the same way the architecture is a bet on the input side. Get it right and you need less data. Get it wrong and the model fights the geometry of the problem.

Counts are the clean example. Fit counts with MSE and the model will cheerfully predict negative rates and assume the noise is symmetric and constant, neither of which is true for a Poisson process. Use the log link and Poisson NLL and the predicted rate is positive by construction and the variance grows with the mean, because that is what counts do.

Two heads worth knowing past the basics

The first is the heteroscedastic Gaussian: instead of predicting only the mean, the network predicts the mean and the variance. The output is two numbers, and the loss lets the model say how unsure it is, input by input. Uncertainty stops being a constant you assume and becomes something the model reports. I built this one in scratchnn; it is a small change to the head and a slightly longer gradient.

The second is the mixture density head: predict a mixture of Gaussians rather than one, for outputs that are genuinely multimodal, where the right answer is “either roughly here or roughly there” and the average of the two is wrong. I do not build that one in the library. I describe its shape and stop, because the point is that it is the same idea, a richer assumed distribution, and not a new kind of machinery.

The body of the network never changes through any of this. Only the head, and the assumption it encodes, does. That separation, the network proposes logits and the loss disposes meaning, is the organizing principle the rest of the series keeps leaning on.

The full catalogue, the exponential-family derivation of the canonical-link result, and the actual loss-class code (each one is three small methods) are in Chapter 2 of the book, Inductive Biases in Neural Networks.

Discussion