Optimization

The BoostCamp instructor stressed the importance of knowing the terminology.

Yesterday (2021.08.09) while looking into the AAE from optional assignment 2, I was overwhelmed. It was faster to count the words I knew in a single sentence than the ones I didn’t. Written in English, but it felt like alien language…

I think that’s the downside of self-teaching and accumulating scattered knowledge through internships and graduation projects. Let me at least nail down the terms used in ML.

With that in mind, I’ve documented even the terms I already knew, as long as they were covered in class.

Introduction

Gradient Descent

undefined

First-order iterative optimization algorithm for finding a local minimum of a differentiable function.

Hyper Parameter

Parameter: weights, bias, convolution weights, etc. — things that get updated during training. Hyper parameter: learning rate, network size, loss function type, etc. — things the developer sets manually.

Generalization

Typically, test error increases as training goes on. The performance gap between train and test is called the generalization gap.

Good generalization performance = the model’s performance with training data is similar to its performance in general.

But good generalization performance doesn’t guarantee the model performs well. Because a model that hasn’t properly learned the training data could still have good generalization performance.

Underfitting, Overfitting

Overfitting: the model works well on training data but performs poorly in practice. Underfitting: training didn’t go well enough; the model hasn’t even learned properly.

Of course, the figure above only illustrates the concept. The overfitting example might actually be the desired result in some cases. Judge based on problem definition, domain knowledge, and other factors.

Cross-validation

K-fold validation. ref: https://blog.quantinsti.com/cross-validation-machine-learning-trading-models/

Split training data into k parts.
Use k-1 parts for training.
Use the remaining 1 part for validation.

There’s usually no clue for hyper parameters! So cross validation is used to find the optimal hyper parameter combination.

Once optimal hyper parameters are found, fix them and train on all available training data.

Obviously, test data should never be used in training in any way. It’s essentially cheating. And cheating doesn’t even guarantee a good model.

Bias, Variance

ref: https://work.caltech.edu/telecourse

Think of it like shot groupings on a target. Variance

How consistent the output is for a given input.
Lower = more consistent.
Higher = less consistent.

Bias

How far the output is from the desired value.

Bias and Variance Tradeoff

Assume noise is present in the training data. We can derive that what we are minimizing (cost) can be decomposed into three different parts: $bias^2$ , $variance$ , and $noise$ .

The value I’m minimizing is a single number, but it consists of three components. And these three components have a trade-off relationship — when something decreases, something else increases.

So cost can be decomposed into these three terms. Bias and variance are typically said to have a trade-off.

Bootstrapping

A term from statistics. Bootstrapping is any test or metric that uses random sampling with replacement.

For example, randomly drawing 80 samples from 100 training data points to create multiple model configurations for comparison.

Bagging, Boosting

Bagging (Bootstrapping aggregating)

e.g., Use bootstrapping to subsample training data into multiple sets. Different models produce different outputs from each set, and you use these outputs (extract statistics, do ensemble learning, etc.)

This typically outperforms training a single model on the entire training set once for a single result. A classic technique used in competitions like Kaggle.

Boosting Say 80 data points classify well but the remaining 20 don’t. Create a separate model for those 20. Call these models weak learners.

Chain these weak learners sequentially to build one strong learner. In boosting, the weights of weak learners are trained sequentially.

Practical Gradient Descent Method

Stochastic gradient descent
- Update parameters using one data point at a time.
Mini-batch gradient descent
- Update parameters using a subset of data at a time.
Batch gradient descent
- Use all data at once.

Batch Size Matters

We .. present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contrast, small-batch methods consistently converge to flat minimizers… this is due to the inherent noise in the gradient estimation.

The paper states that large batch methods yield sharp minimizers, while small-batch methods yield flat minimizers. The explanation follows the graph below.

ref: https://arxiv.org/pdf/1609.04836.pdf (ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA)

Flat minimum: even if test function and training function are far apart, learning still works to some extent. High generalization performance! Sharp minimum: low generalization performance.

Gradient Descent Methods

(Stochastic) Gradient Descent

The familiar basic gradient descent parameter update formula.

Problem: setting the learning rate is hard. Too small and training doesn’t progress; too large and training doesn’t work properly.

Momentum

Parameters are updated while maintaining momentum (inertia). beta (momentum) is a hyper parameter. A new accumulation is computed from beta, the gradient, and the previous step’s accumulation. This retains some information from the previous step instead of discarding everything like SGD, while updating W.

Nesterov Accelerated Gradient (NAG)

The overall formula structure is the same as momentum. The difference is that it pre-computes the gradient at the next step and uses this lookahead gradient to update the accumulation.

ref: https://golden.com/wiki/Nesterov_momentum

The difference between momentum and Nesterov momentum is illustrated above. Intuitively, momentum might not converge to the convergence point directly, swinging back and forth like a pendulum.

Nesterov uses the next step’s gradient, which has the effect of moving in only one direction. So NAG typically converges faster.

Adagrad

Reflects how much each parameter has changed so far in the update. $G_t$ is the sum of gradient squares. If a parameter has changed a lot, $G_t$ is large, so the parameter changes less. If a parameter has changed little, $G_t$ is small, so the parameter changes more.

$\epsilon$ is there to prevent zero division.

Problem: $G_t$ can grow indefinitely. If the denominator approaches infinity, the term converges to 0. The parameter stops updating.

Adadelta

Looks at gradient changes only within a window of a given size.

The problem is that $g_t$ must have as many parameters as the model. Since each model parameter has its own gradient, a model like GPT-3 with 100 billion parameters would need $g_t$ to store gradient info for 100 billion parameters across the window size.

This is solved using exponential moving average (EMA). Seems like that’s what the $\gamma$ in the formula does…

No learning rate! = No room to adjust hyper parameters. So it’s not practically useful.

RMSprop

Not published as a paper. Geoff Hinton revealed it during a lecture. Papers that use RMSprop actually cite Geoff Hinton’s lecture link.

Adam (Adaptive Moment Estimation)

Combines past gradients (momentum) and squared gradients (Adagrad, RMSprop…). That is, it mixes momentum information with adaptive learning rate methods.

Tuning the 4 hyper parameters is also very important.

$\epsilon$ : a very small value
$\beta_1$ : momentum
$\beta_2$ : gradient squares
$\eta$ : learning rate

Regularization

The goal is to regularize and impede training so the model works well not just on training data but also on test data.

Early Stopping

Using validation data (not test data), determine the right point to stop.

Parameter Norm Penalty

Preventing parameters from growing too large.

Train in the direction that minimizes total cost.

The instructor said something about making functions smoother within the function space (?). I didn’t quite get it…

Parameter norm penalty is also called weight decay.

Data Augmentation

Unlike traditional ML, DL and NNs benefit from more data. So generate as much data as possible.

Increase data by varying image size, rotation, cropping, etc. Labels must stay fixed though.

Noise Robustness

Adding noise to input data and weights during training improves performance at test time. Not fully proven, but experimentally demonstrated.

Label Smoothing

During training, extract two training data points, mix them, and create new training data. Said to have the effect of smoothing the decision boundary.

A method that can improve model performance significantly!

ref: https://arxiv.org/pdf/1905.04899.pdf (CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features)

Mixup: mix two images and their labels. Cutout: remove a portion of an image. CutMix: unlike Mixup, cut and paste.

Dropout

Randomly set some neurons to 0. Interpreted as neurons acquiring more robust features. This is also unproven.

Batch Normalization

Normalize weights per layer using the mean and variance of the weights. As shown in the formula, subtract the mean and divide by the variance to get new weights.

The paper interpreted this as reducing internal covariate shift, which improves performance, but there are several rebuttal papers…

What’s certain is that using BN significantly improves performance as the network gets deeper.

There are methods similar to BN. Use them as appropriate.