Skip to main content

Recurrence Is Weight Sharing Across Time, and It Costs You

A recurrent network is an MLP with one prior wired into it: the same small computation runs at every timestep, and a hidden state carries a running summary of everything seen so far. The cell is

$$h_t = \tanh(W_{xh}\, x_t + W_{hh}\, h_{t-1} + b),$$

and the weights do not depend on $t$. One cell, reused down the whole sequence.

That is the same move the convolution made, with time in place of space. Weight sharing across positions gave translation equivariance for images; weight sharing across timesteps gives time-translation equivariance for sequences. Shift the input in time, the output shifts. And again the backward pass turns the reuse into a sum: the gradient for the shared weights accumulates over every timestep, the same += over a shared axis that showed up for convolution.

What is new: the output is the next input

The convolution applied its kernel to fixed input positions. The recurrent cell feeds its own output back in as the next step’s state. The dependency is sequential and arbitrarily long, and that changes how you train it. You unroll the cell across the sequence and backpropagate through time. The only twist is that the gradient arriving at step $t$ has two sources, and you add them: what flowed back from this step’s own output, and what flowed back from the future through the next state. That merge is the whole of backpropagation through time. Everything else is the chain rule you already had.

The price of the prior

Backprop through time is correct and fragile. Each step backward multiplies the gradient by a Jacobian, so the gradient that reaches an early timestep is a long product of them, governed by the spectrum of the recurrent matrix and the slope of the tanh, which is at most one and usually less. If that product shrinks, the gradient to early steps vanishes exponentially and the model gets almost no signal about long-range structure. If it grows, the gradient explodes and one bad sequence wrecks the weights.

This is the vanishing-gradient problem, and it is worth seeing it as the bill for the inductive bias, not a bug. The architecture has a path from the distant past to the present, through the state. The learning dynamics make that path hard to use. The cell can in principle remember something from a hundred steps ago; in practice a vanilla recurrent network reaches maybe a few dozen.

What it looks like

Train a character-level recurrent network on the opening of Alice in Wonderland and the samples walk a recognizable arc. First random characters. Then a plausible character distribution: vowels and consonants alternate, spaces fall at about the right spacing. Then word-like fragments and short real words. Then text that is locally English and drifts in meaning across longer spans, because the long-range memory is exactly what the vanishing gradient denies it. The arc is the lesson. The absolute quality, at this scale and in pure Python, is modest, and I would rather say that plainly than dress it up.

The standard fixes are gated cells, the LSTM and friends, which add a path the gradient can travel without attenuation, and gradient clipping for the exploding side. The honest throughline is that the same Jacobian-product problem is what motivated gated recurrence and, eventually, attention, which throws out the sequential state entirely.

The cell’s forward and backward-through-time code, the Jacobian argument for vanishing gradients, and the full Alice experiment are in Chapter 5 of the book, Inductive Biases in Neural Networks.

Discussion