A neural network is a parameterized function. It takes a vector of numbers and returns a vector of numbers, and that is the whole structural commitment. What the output numbers mean, a probability, a class score, a count, a rate, is a separate decision. The network does not make it. The loss does.
I wrote a small library, scratchnn, to make this concrete. Pure Python, standard library only, no NumPy in the core, because the point is to read it rather than run it fast. This post is the first stop in a tour through it. The organizing question for the whole series is simple: every architecture and every loss is a bet about the data. The bets have a name. They are inductive biases, and the right one is what lets a model learn more from less.
The cheapest model, and where it fails
Start with the smallest network: one linear layer, $\mathbf{z} = W\mathbf{x} + \mathbf{b}$. No hidden units, no nonlinearity. This is a single linear unit, and the function it can represent is exactly a line (a hyperplane, in higher dimensions). It fits anything linearly separable, and nothing else.
The standard way to see “nothing else” is XOR: the four points of the unit square, with the two diagonal corners labeled 1 and the other two labeled 0. Try to separate them with a sigmoid on top of a single linear unit and you can write down what success would require:
$$b < 0, \quad w_1 + b > 0, \quad w_2 + b > 0, \quad w_1 + w_2 + b < 0.$$Add the middle two and you get $w_1 + w_2 + b > -b > 0$. The last one says it is below zero. There is no assignment of weights that satisfies all four. The model cannot fit XOR, and if you train it, every input drifts to probability one half and stays there. Geometrically, one line cannot separate two corners that sit on a diagonal. This was Minsky and Papert’s argument in 1969, and it stalled the field for over a decade.
What a hidden layer buys
The fix is one hidden layer with a nonlinearity between the linear maps. Without the nonlinearity the layers collapse: two affine maps composed are just one affine map, and stacking buys nothing. With it, the hidden units can build intermediate features that are linearly separable, and the output layer reads them off. For XOR, one hidden unit can detect OR and another AND, and the difference is XOR. A 2-8-1 network with a tanh in the middle learns it cleanly.
The general statement is the universal approximation theorem: one hidden layer of sufficient width, with almost any nonlinearity, approximates any continuous function on a bounded region to whatever precision you want. Worth being honest about what that does and does not say. It says depth is not required for expressivity; one wide hidden layer is enough in principle. It does not say one layer is efficient. Many functions that a single huge layer can only represent at exponential width are cheap to represent with a few layers stacked. Depth is about economy, not possibility. The reason to have a hidden layer at all is XOR. The reason to have several is that it is usually cheaper.
The other half: the output is interpreted, not produced
Here is the move the rest of the series leans on. The network produces raw, unnormalized scores, call them logits. It never produces a probability. A probability is something the loss computes, by composing the logits with a link function (a sigmoid, a softmax, the identity) and a matching negative log likelihood. The same network body becomes a binary classifier, a multi-class classifier, a real-valued regressor, or a count model depending only on the head you bolt onto its outputs. Nothing about the body changes.
One small fact that turns out to run deep: the gradient of every one of these canonical losses, with respect to the logits, is $\hat{p} - y$, the prediction minus the target. The residual. It looks like a coincidence the first time you see it across logistic regression and softmax and least squares. It is not. The next post is about why, and about treating the output head as a modeling choice in its own right.
The shape of the series
Two axes have already appeared. The architecture decides which functions are reachable and cheap (a single unit cannot do XOR; an MLP can). The output head decides what the outputs mean (the loss, the link function, the assumed distribution). Both are inductive biases. A third axis shows up later, once we start training real models and find that having the right architecture is necessary but not sufficient.
Everything from here is the same MLP with a sharper prior baked in at one of these axes: convolution for images, recurrence for sequences, attention for content-addressable lookup, and so on. The math we need (forward, backward, the chain rule applied locally) does not change. Only the layer types and the heads do.
The full version, with the hand-derived backward pass, the actual Layer and Network code, the numerical-stability details, and the gradient checker that keeps the math honest, is Chapter 1 of the book, Inductive Biases in Neural Networks.
Discussion