RNN

Sequence data

Data that must proceed in order: audio, strings, stock prices, etc.
Easily violates the iid (independent and identically distributed) assumption.
- For example, “the dog bit the person” and “the person bit the dog” have entirely different data distributions, frequencies, and meanings.
Changing the order or losing past information alters the probability distribution of the data.
- Predicting the future without past information or context is impossible.

Handling sequence data

Conditional probability is used to model the probability distribution of upcoming data based on previous sequence information.

If we want to use all past information to compute conditional probabilities, the formula looks like the above.

Typically, sequence data is handled as follows:

Not all past information is needed. This varies heavily by domain.

For example, to predict a stock price for a company founded 30 years ago, you don’t need all data from day one. Usually about 5 years of data is used.

=> Truncating information is itself a skill.

Sequence data must be handled with variable lengths as shown above. A model that can handle variable-length inputs is needed.

Autoregressive model

There are cases where only a fixed-length sequence of size tau is used. This is called an AR(tau) (Autoregressive model).

Deciding tau itself requires substantial domain knowledge.
tau must be set short or long depending on the need.

Latent autoregressive model

To predict Xt, both Xt-1 and Ht are used.
Ht (latent variable) contains information from Xt-2 all the way back to X1.
Variable-length data is converted to fixed-length data. This makes it easier for models to process.
Problem: how to encode Ht?

RNN (Recurrent Neural Network)

A model that learns sequence data patterns by repeatedly using the latent variable Ht from the latent autoregressive model through a neural network.

The network can be expressed mathematically as:

Xt: current time step’s sequence data
Ht: latent variable up to the current time step
W(1), W(2): weight matrices shared across all time steps.

This network can only handle the current time step’s data. So the network is extended as follows:

Wx(1): weight matrix combined with the current time step’s data
WH(1): weight matrix combined with the previous time step’s latent variable
Ht: newly computed latent variable. Copied and used to encode the next latent variable.
Fixed weight matrices used across the entire network: Wx(1), WH(1), W(2)

BPTT (Backpropagation through time)

The backpropagation method for RNN.

Red: gradient flow path
Blue: forward propagation

As sequence length grows, the term inside the red box becomes unstable. If the value inside is less than 0, it shrinks toward zero; if greater than 0, it grows unboundedly.

Truncated BPTT

Computing gradients for all sequence steps causes the differential terms to become very unstable, leading to gradient vanishing.

So we truncate appropriately.

For example, in the figure above, BPTT proceeds normally until Ht receives gradient information only from Ot, and the gradient is updated from there.

But even this has limits, which is why LSTM and GRU were developed to address the problem.