Goal
On device AI
- Overcome the constraints of models deployed on personal devices
- power usage
- RAM
- Storage
- Computing power
AI on cloud
- Latency and throughput matter because many users need to use it
- e.g., time per request, number of requests processable per unit time
- Need to achieve lower latency and higher throughput with the same resources
Computation

- The number of operations performed by the model itself needs to be reduced.
- Since 2012, the computation required for model training has doubled every 3-4 months.
Efficient Architecture Design
 A chart showing parameter counts and performance of released CNN models. The models in the chart tried to efficiently reduce parameter counts while improving performance. These are representative examples of Efficient Architecture Design — designing the model itself to be efficient.
AutoML; Neural Architecture Search (NAS)
A useful technique worth knowing — it has many applications. Instead of humans, use algorithms to design or find efficient models. 
The controller is a model that proposes model architectures. The accuracy is computed with the proposed model, and this metric is used to retrain the controller. Repeating this process should help find efficient models.

Models obtained through AutoML/NAS are likely to be architectures that don’t match human intuition. Still, such models can outperform existing ones.
Network Pruning
- Remove model parameters with low importance
- Topic: defining and finding good importance measures
- e.g., compute the L2 norm or loss gradient of a given parameter to measure importance
- Divided into structured/unstructured pruning
Structured pruning
- A collective term for techniques that prune parameters in groups
- Groups: channel, filter, layer, etc.
- Suitable for SW/HW optimized for dense computation since pruning happens in bulk  Groups are formed by channel for the original network, and a factor is computed per layer. Removing layers with low importance produces a compressed model.
Unstructured pruning
- Prunes each parameter independently
- Since it’s applied individually, the internal matrices of the network become increasingly sparse as more pruning is performed
- Suitable for SW/HW optimized for sparse computation
Knowledge Distillation
Use a pre-trained large model to assist training of a smaller network.  The student loss part is the same as regular network training. Loss is computed using ground truth and prediction results. The distillation loss part is where knowledge distillation happens. Instead of ground truth, soft labels from the teacher model are used to compute loss against predictions.
 Soft targets (soft outputs) contain more information than ground truth. The figure above represents the probability as color when displaying predicted labels per row. Unlike ground truth, instead of using only one predicted label for classification, all label probabilities can be used. This allows training the smaller network with more information.
Formula

- Left term: cross-entropy between student network and ground truth
- Right term: KLD loss between teacher network and student network
- : temperature hyperparameter. Makes small softmax outputs larger and large ones smaller. ref
- : weight for the two losses
Matrix/Tensor Decomposition
Mathematically involved, but even simple applications can yield effective results.
- Express a single tensor as a sum/product of smaller tensors
CP decomposition
Approximate a tensor as the sum of outer products of rank-1 vectors.
Network Quantization
Map fp32 to fp16 or int8. 
After applying quantization, dequantizing the computation results will produce errors compared to fp32 results. However, empirically, models are known to be robust to such errors.
- model size: decreases
- accuracy: slight decrease
- time: depends on HW. Generally improving regardless of HW
- e.g., on certain HW, int8 quantization might actually be slower
Network compiling
- When the target system is fixed, compile the network itself for efficient computation.
- The technique with the biggest impact on speed.
- TensorRT (NVIDIA), TFLite (Tensorflow), TVM (Apache)
- Performance varies by combination of compile library, HW system, and model. 
With rule-based compiling, the graph is simplified according to defined rules as shown above. The simplified result is called a fusion.

There are also attempts to find good fusions via AutoML. Considering all combinations of framework, HW, and system produces a large search space. So AutoML is used to find optimized fusions for the target system. Apache’s AutoTVM is said to serve that purpose.