Probability Theory

These are topics I struggled with even when writing notes about likelihood last year. I’ve reorganized them here based on BoostCamp content.

Probability Theory

Deep learning is built on probability-based machine learning theory.

Probability Distributions

In data space (X x y), probability distribution D is the distribution from which data is sampled in the data space.

Since y is assumed, this explanation is based on supervised learning with ground truth labels.

Random Variables

Random variable = observable data in the data space.

Random variables are used when extracting data.
A probability distribution refers to the distribution from which random variables are drawn.

Types of Random Variables

Random variables are classified as discrete or continuous depending on the distribution D.

Not classified by the data space. For example, random variables in integer space are necessarily discrete. But a random variable in real-number space is also discrete if only -0.5 and 0.5 are selectable.

Discrete Random Variables

Modeled as the sum of probabilities considering all possible cases. Called a probability mass function.

Continuous Random Variables

Modeled by integrating the density of random variables defined in the data space.

The density is as follows:

Density is the rate of change of the cumulative distribution function — it is not a probability!

Joint Distribution

Given the full data X and y, we can posit a distribution, called the joint distribution. The joint distribution models probability distribution D.

In the figure above, the actual data points are blue dots. They look like continuous random variables, but if we posit the joint distribution as the red boxes, they can be treated as if they were discrete.

The type of the actual data distribution and the type of the joint distribution are unrelated. It depends on how you model it.

Because we’re dealing with data computationally, we just need to set the joint distribution P(X, y) appropriately to approximate the true distribution D.

Marginal Probability Distribution

P(x) = marginal probability distribution for input x; no information about y. As shown, you can count occurrences along x or provide integrated information.

The marginal distribution for y can also be defined. That is, counting or integrating along y to define P(y).

Conditional Probability Distribution

P(x|y) = models the relationship between input x and output y. As shown, the conditional distribution can model x’s information when y=1.

Conditional Probability and Machine Learning

P(y|x) = the probability that the answer is y for input variable x.

In logistic regression, the combination of a linear model and softmax is used to interpret patterns extracted from data as probabilities.

How to compute conditional probability P(y|x):

In classification, softmax(W*phi + b) is computed using feature pattern phi(x) extracted from data x and weight matrix W.
It’s fine to write P(y|phi(x)) instead of P(y|x).

Deep learning:

NNs extract feature patterns phi from data.

Expectation

When analyzing data given a probability distribution, various statistical functionals can be computed.

Expectation is the representative statistic of data. It’s the mean. It’s also used to compute other statistical functionals from the probability distribution.

For continuous distributions, computed by integration; for discrete distributions, by summation.

Usage

Used to compute variance, kurtosis, covariance, etc.

Estimating Conditional Expectation in Regression

Conditional expectation coincides with the function that minimizes the L2 norm.

For robust estimation in regression, the median is used instead of conditional expectation.

Monte Carlo Sampling

Most machine learning problems start without knowing the probability distribution.

That is, we need to compute the expectation using only data, and that’s where Monte Carlo sampling comes in.

Formula explanation:

Substitute sampled data x into f.
Compute the arithmetic mean of the sampled data.
This value approximates the expectation.

Monte Carlo works for both discrete and continuous cases.

Monte Carlo sampling requires independent draws.

Convergence is guaranteed by the law of large numbers.

Monte Carlo Sampling Example

Integrating the function above on [-1, 1] is analytically impossible. That’s when Monte Carlo sampling is used.

To structure the integral formula like Monte Carlo sampling, divide the integral expression by 2. Because in integration there’s no concept of “number of elements,” the length of the x range being integrated is used as if it were the number of elements.
Draw N data points uniformly from [-1, 1] and compute the arithmetic mean.

1
def mc_int(fun, low, high, sample_size=100, repeate=10):
2
    int_len = np.abs(high - low)
3
    stat = []
4
    for _ in range(repeat):
5
      x = np.random.uniform(low=low, high=high, size=sample_size)
6
        fun_x = fun(x)
7
        int_val = int_len * np.mean(fun_x)
8
        stat.append(int_val)
9
    return np.mean(stat), np.std(stat)
10
def f_x(x):
11
    return np.exp(-x**2)
12

13
print(mc_int(f_x, low=-1, high=1, sample_size=10000, repeat=100))