Data Engineering
- Data Cleansing, Preprocessing
- Feature Engineering
- Select ML Algorithm
- DL: Select Backbone Model
- Set Hyperparameters
- DL: Loss, Optimizer, Learning rate, batch size
Model architecture and hyperparameter selection is typically done by humans, incorporating feedback from train/evaluate cycles. AutoML’s goal is to remove humans from this process and automate it.
Definition

This formalizes what was described when explaining AutoML’s goal. Given hyperparameters, ML algorithms, and data, HPO (Hyperparameter Optimization = AutoML) aims to find the hyperparameter configuration that minimizes loss.
A Different Angle on Lightweight Models
- Lightweight existing models
- Pruning, Tensor decomposition
- Find lightweight models through searching
- NAS (Neural Architecture Search), AutoML
AutoML is a technique for finding lightweight models.
DL Model Configuration
Type
- Categorical
- Optimizer: Adam, SGD, AdamW …
- Module: Conv, BottleNeck, InvertedResidual
- Continuous
- learning rate, regularizer param, …
- Integer
- Batch size, epochs
Conditional Configuration
The search space changes depending on the configuration.
- The types and search space of optimizer parameters change depending on the optimizer.
- Module sample (Vanilla conv, BottleNeck, InvertedResidual) determines module-specific parameters and search space.
AutoML Pipeline

Similar to the HPO definition described earlier. The addition is the objective function . The definition of can vary. You might only want to reduce model size, or only improve model performance, or some mix of objectives.
Blackbox optimization is performed to maximize these objectives and find a new configuration .
Bayesian Optimization (BO)
This structures the blackbox optimization as shown above.
- Surrogate function: A regression model that predicts . If it can predict accurately, it can better determine which to try next.
- Acquisition function: Determines the next to try.

The process from the diagram, listed in order:
- Sample (x) (observation)
- Train the DL model with that configuration
- Compute the objective. This corresponds to the observation (x) in the diagram.
- Update the surrogate model. Represented by the solid line and purple region in the diagram. e.g., GP (Gaussian Process) model, posterior mean, posterior variance (uncertainty)
- Update the acquisition function. Represented by the green region in the diagram. It looks at the surrogate model’s trend and predicts the best next .
BO with GPR
Gaussian Process Regression
A method for modeling uncertainty. In the BO graph, only two points of the surrogate model are known; everything else is uncertain. Using GP, we can obtain ranges for values beyond those two known points.
Standard Regression Task Set of train data: Set of test data:
GP’s Idea
- Might the value at a specific position be related to the already known ?
- Regardless of positive or negative relationship
- Let’s express the estimation of from through a kernel function .
Informal Definition of GP
- : Defined as a random variable for input x = distribution of possible functions for input x
- Distribution of random variables: Multivariate Gaussian distribution
Expanding the definition of in GP terms:
- Define a distribution of functions. Assume this distribution follows a Multivariate Gaussian distribution.
- = function follows a Gaussian process.

The formula above formalizes what was just explained.
Gaussian Identities apply here: the marginal and conditional of a Gaussian also follow a Gaussian.

This can be visualized as shown above. Viewing the original Gaussian from any conditional side, that conditional also follows a Gaussian.

What GP tells us is that given , we can know the mean and distribution of .
Surrogate Model
Using the concepts organized above, let’s look at the surrogate model in more detail.
- def: A model that predicts the objective
- Uses observed to predict the objective for a new
- Train the surrogate model and use it as a criterion for selecting the next good
- Representative surrogate models
- GPR (Gaussian Process Regression) model
- mean: predicted value, var: uncertainty

- mean: predicted value, var: uncertainty
- GPR (Gaussian Process Regression) model
As observation data increases, uncertainty decreases and predictions fit the true function.
Acquisition Function
- def: A function that uses the surrogate model’s output to determine which to try next
- The formula is constructed to appropriately balance exploration and exploitation. The balance between them is determined heuristically.
- Exploration: Explore uncertain regions
- Exploitation: Explore the known best regions
- The max point of the updated acquisition function is tried at the next iteration

The top graph is the surrogate model; the bottom graph is the acquisition function. The acquisition function values become very small at certain points and large near those values. From an exploitation perspective, already known points don’t need exploring, and the vicinity of known values offers the best exploration points. This is how the acquisition function is constructed.
e.g., Upper Confidence Bound (UCB)

- : posterior mean (= Exploitation)
- : posterior variance (= Exploration)
- : balancing parameter
BO with TPE
GP’s problems:
- Complexity:
- Difficult to apply when conditional, continuous/discrete parameters are mixed The second issue causes the most difficulty in practice, and TPE is widely used nowadays.
TPE (Tree-structured Parzen Estimator) vs GPR difference:
- GPR: Computes (posterior distribution)
- TPE: Computes (likelihood), (prior)