Skip to main content

Bengio's Language Model: the Markov Assumption Made Architectural

The simplest neural language model is almost embarrassingly direct. Look at the last $N$ tokens. Embed each one through a shared lookup table. Concatenate the embeddings into one vector. Run it through a small MLP that outputs a distribution over the next token. That is the entire model, and it is roughly what Bengio and coauthors introduced in 2003, the architecture that held the field before recurrent networks and then transformers took over.

It is worth building because it makes one prior completely explicit.

A fixed window is the Markov assumption, in hardware

The model can only see $N$ tokens back. Anything older is gone, not because the optimizer failed to learn it but because there is no wire for it to travel on. This is the order-$N$ Markov assumption, turned from a modeling approximation into a structural fact of the architecture. Choose $N$ too small and the model cannot represent long-range agreement; choose it too large and the MLP head grows linearly in $N$ and starts to overfit.

A hybrid prior worth noticing

The interesting part is how Bengio handles position. The embedding table is shared across positions: the token “the” gets the same vector whether it sits at slot 1 or slot 5. That is position equivariance, the same weight-sharing idea as the convolution, applied to token identity. But then the concatenation step makes the head position sensitive. Token 1’s embedding always lands in the first slice of the input vector, token 2’s in the second, and so on, so the MLP can learn that the immediately preceding token matters more than the one $N$ back. You get word-identity invariance for free and still use word order. One layer is equivariant, the next is deliberately not.

What dropping recurrence buys

The reason this architecture is worth keeping in the toolbox, even now, is what it avoids. There is no recurrence, so there is no backpropagation through time and no vanishing-gradient problem (the subject of the next post). Every training example, a window and its next token, is independent of every other, so training parallelizes trivially and there is no per-sequence bookkeeping.

On character-level text (the opening of Alice in Wonderland, which I also use for the recurrent model so the comparison is fair) the fixed-context model reaches the same loss as the recurrent network in roughly a quarter of the epochs. Same endpoint, less effort, because there is no unrolled chain to push gradients back through.

What you give up is the cutoff. The model genuinely cannot use a token from a hundred positions back, where a recurrent network at least has a path for it (a thin and leaky one, as we will see). It is the right model when the context that matters is small and you know how small. When it is not, you reach for an architecture with unbounded reach, and pay for that reach in other ways.

The embedding and concatenation code, the parameter accounting, and the side-by-side Alice experiment against the recurrent model are in Chapter 4 of the book, Inductive Biases in Neural Networks.

Discussion