AutoML

Data Engineering

Data Cleansing, Preprocessing
Feature Engineering
Select ML Algorithm
- DL: Select Backbone Model
Set Hyperparameters
- DL: Loss, Optimizer, Learning rate, batch size

Model architecture and hyperparameter selection is typically done by humans, incorporating feedback from train/evaluate cycles. AutoML’s goal is to remove humans from this process and automate it.

Definition

This formalizes what was described when explaining AutoML’s goal. Given hyperparameters, ML algorithms, and data, HPO (Hyperparameter Optimization = AutoML) aims to find the hyperparameter configuration that minimizes loss.

A Different Angle on Lightweight Models

Lightweight existing models
- Pruning, Tensor decomposition
Find lightweight models through searching
- NAS (Neural Architecture Search), AutoML

AutoML is a technique for finding lightweight models.

DL Model Configuration

Type

Categorical
- Optimizer: Adam, SGD, AdamW …
- Module: Conv, BottleNeck, InvertedResidual
Continuous
- learning rate, regularizer param, …
Integer
- Batch size, epochs

Conditional Configuration

The search space changes depending on the configuration.

The types and search space of optimizer parameters change depending on the optimizer.
Module sample (Vanilla conv, BottleNeck, InvertedResidual) determines module-specific parameters and search space.

AutoML Pipeline

Similar to the HPO definition described earlier. The addition is the objective function $f$ . The definition of $f$ can vary. You might only want to reduce model size, or only improve model performance, or some mix of objectives.

Blackbox optimization is performed to maximize these objectives and find a new configuration $\lambda$ .

Bayesian Optimization (BO)

This structures the blackbox optimization as shown above.

Surrogate function: A regression model that predicts $f(\lambda)$ . If it can predict accurately, it can better determine which $\lambda$ to try next.
Acquisition function: Determines the next $\lambda$ to try.

The process from the diagram, listed in order:

Sample $\lambda$ (x) (observation)
Train the DL model with that configuration
Compute the objective. This corresponds to the observation (x) in the diagram.
Update the surrogate model. Represented by the solid line and purple region in the diagram. e.g., GP (Gaussian Process) model, posterior mean, posterior variance (uncertainty)
Update the acquisition function. Represented by the green region in the diagram. It looks at the surrogate model’s trend and predicts the best next $\lambda$ .

BO with GPR

Gaussian Process Regression

A method for modeling uncertainty. In the BO graph, only two points of the surrogate model are known; everything else is uncertain. Using GP, we can obtain ranges for values beyond those two known points.

Standard Regression Task Set of train data: $(X,Y)$ Set of test data: $(X_*,Y_*)$ $Y\approx f(X) + e$

GP’s Idea

Might the $Y_*$ $Y_{*}$ value at a specific position be related to the already known $X, Y, X_*$ $X, Y, X_{*}$ ?
- Regardless of positive or negative relationship
Let’s express the estimation of $Y_*$ from $X, Y, X_*$ through a kernel function $K$ .

Informal Definition of GP

$f(x)$ $f (x)$ : Defined as a random variable for input x = distribution of possible functions for input x
- Distribution of random variables: Multivariate Gaussian distribution

Expanding the definition of $f(x)$ in GP terms:

Define a distribution of functions. Assume this distribution follows a Multivariate Gaussian distribution.
- = function $f$ follows a Gaussian process.

The formula above formalizes what was just explained.

Gaussian Identities apply here: the marginal and conditional of a Gaussian also follow a Gaussian.

This can be visualized as shown above. Viewing the original Gaussian from any conditional side, that conditional also follows a Gaussian.

What GP tells us is that given $X_*, X, f$ , we can know the mean and distribution of $f_*$ .

Surrogate Model

Using the concepts organized above, let’s look at the surrogate model in more detail.

def: A model that predicts the objective $f(\lambda)$ $f (λ)$
- Uses observed $f(\lambda)$ to predict the objective $f(\lambda_*)$ for a new $\lambda_*$
Train the surrogate model and use it as a criterion for selecting the next good $\lambda$
Representative surrogate models
- GPR (Gaussian Process Regression) model
  - mean: predicted $f$ value, var: uncertainty

As observation data increases, uncertainty decreases and predictions fit the true function.

Acquisition Function

def: A function that uses the surrogate model’s output to determine which $\lambda$ to try next
The formula is constructed to appropriately balance exploration and exploitation. The balance between them is determined heuristically.
- Exploration: Explore uncertain regions
- Exploitation: Explore the known best regions
The max point of the updated acquisition function is tried at the next iteration

The top graph is the surrogate model; the bottom graph is the acquisition function. The acquisition function values become very small at certain points and large near those values. From an exploitation perspective, already known points don’t need exploring, and the vicinity of known values offers the best exploration points. This is how the acquisition function is constructed.

e.g., Upper Confidence Bound (UCB)

$\mu$ : posterior mean (= Exploitation)
$\sigma$ : posterior variance (= Exploration)
$\kappa$ : balancing parameter

BO with TPE

GP’s problems:

Complexity: $O(N^3)$
Difficult to apply when conditional, continuous/discrete parameters are mixed The second issue causes the most difficulty in practice, and TPE is widely used nowadays.

TPE (Tree-structured Parzen Estimator) vs GPR difference:

GPR: Computes $p(f|\lambda)$ (posterior distribution)
TPE: Computes $p(\lambda|f)$ (likelihood), $p(\lambda)$ (prior)