A Network Computes Numbers. The Loss Decides What They Mean.
Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.
Browse posts by tag
Function approximation, why one linear unit cannot learn XOR, and what a hidden layer actually buys. The opening of a from-scratch tour of inductive bias.
An attention head is a learned content-addressable lookup: a query matches keys, retrieves a value, exactly like dereferencing a pointer. Depth is how many lookups you can compose.
The trained pointer model reads exactly the right memory cell, provably. Its attention barely shows where. The gap, and the causal probe that closes it.
The simplest neural language model: embed the last N tokens, concatenate, predict the next. What you give up with a fixed window, and what you gain by dropping recurrence.
A recurrent network reuses one cell at every timestep and carries a state. That buys time-translation equivariance and unbounded reach in principle, and bills you in vanishing gradients.
When the only signal is a number at the end of a trajectory. How REINFORCE turns out to be the same gradient as classification, scaled by return, and where the theory tops out.
Choosing a loss is choosing a distribution for your output. Why every supervised network is a maximum-likelihood estimator, and why the gradient is always the residual.
A convolution is a bet about images: nearby pixels matter together, and a feature detector should fire anywhere. How to test whether a model is actually using that bet.
Part 3 of What Your RL Algorithm Actually Assumes — the architecture decides what kind of features can be learned, and that decision is a Bayesian prior over value functions.
Perspective of deep learning from applied math. Bridges math with neural nets.
Free open-source deep learning book with code and math integrated. Interactive deep learning resource with runnable code.
Learning fuzzy membership functions and inference rules automatically through gradient descent on soft circuits, instead of hand-crafting them.
The evolution of neural sequence prediction, and how it connects to classical methods
The bias-data trade-off in sequential prediction: when to use CTW, n-grams, or neural language models.