https://deepgenerativemodels.github.io/ A Stanford course that the lecture referenced.
Generative Model
It’s not just about generating images and text.

Say we receive a set of dog images.
We can expect a generative model to learn a probability distribution .
- Generation: when sampling , should look like a dog.
- Density estimation: using to determine whether an arbitrary input is a dog, not a dog, a cat, etc. (anomaly detection)
- In a strict sense, a generative model subsumes a discriminative model.
- A model that can yield probability values is called an explicit model.
- Unsupervised representation learning (feature learning): learning features in an unsupervised manner
- The professor found this debatable, but the Stanford lecture claims this is also something generative models aim for.
Basic Discrete Distributions
Some basic math prerequisites. This was also covered in Professor Im Sung-bin’s class, but worth reviewing.
Bernoulli Distribution
Bernoulli needs just 1 parameter.
- D = {Heads, Tail}
- P(X=Heads) = p, then P(X=Tails) = 1 - p
- Write: X ~ Ber(p)
Categorical Distribution
Categorical needs m-1 parameters. If you know m-1 elements, the remaining one is determined automatically.
- D = {1, …, m}
- P(Y=i) = , such that = 1
- Write: Y ~ Cat(p1, …, pm)
RGB

- number of cases = 256 x 256 x 256
- number of parameters = 256 x 256 x 256 - 1
- The number of parameters to represent a single RGB pixel is enormous. Obvious, but still.
Binary Image

- Assume a binary image with n pixels.
- states are needed.
- Sampling from generates an image.
- Sampling requires parameters.
The number of parameters is too large. Can we reduce it?
Structure Through Independence
Suppose in a binary image are independent. This doesn’t really make sense — if all pixels were independent, the only representable image would be white noise. But let’s assume it anyway.
The number of possible states remains .
But the number of parameters for is just n. Because each pixel needs only 1 parameter, and since they’re all independent, the total is n.
Chain Rule

No assumptions needed — it’s a theorem. Think of it as starting from the fully dependent model.
Total number of parameters: . Exponential reduction achieved.
Bayes’ Rule

Conditional Independence
 x and y are conditionally independent given z; p of x given y and z equals p of x given z.
If z is given and x and y are independent, then when looking at a random x, y is irrelevant.
This theorem lets you drop independent variables from the conditional in the chain rule or other formulas. Using this, we’ll build a good model between the fully dependent and fully independent extremes.
Markov Assumption
Apply the Markov assumption to the chain rule. Similar to the assumption in RNNs — the current state depends only on the immediately previous state. In the chain rule, terms that used all previous information now only reference n-1 at time n. 
Total number of parameters: .
Compared to n from the fully independent case it’s larger, but compared to from the chain rule it’s exponential reduction.
Finding this kind of sweet spot in between is what an auto-regressive model does.
Auto-regressive Model

- Assume we use 28x28 binary images.
- Learning over .
- How to parametrize ?
- Use the chain rule to decompose the joint distribution.

- This is called an autoregressive model.
- Using only the immediately previous information (like the Markov assumption) is also autoregressive.
- All random variables must be ordered.
- Performance can vary depending on the ordering.
A model considering only 1 previous step = AR(1) model A model considering n previous steps = AR(N) model
NADE
Neural Autoregressive Density Estimator 

The i-th pixel is dependent on pixels 1 through i-1.
- The first pixel’s distribution depends on nothing.
- The second pixel’s distribution depends only on the first.
- The fifth pixel’s distribution depends on pixels 1 through 4.
- The i-th pixel depends on i-1 pixels.
- Input dimensions change, so the weights keep growing.
- The i-th input needs weights that can accept i-1 inputs.
- Using a mixture of Gaussians at the last layer can produce continuous random variables.
NADE is an explicit model. Since probabilities are computed like the chain rule, probability values can be obtained one way or another.
Implicit models can only do generation.
Pixel RNN
Making an RNN auto-regressive.
The formula for an n x n RGB image: 
Two variants based on ordering: 
- Row LSTM
- Uses information from above
- Diagonal BiLSTM
- Uses all previous information