CNN
ILSVRC
- ImageNet Large-Scale Visual Recognition Challenge
- Classification, Detection, Localization, Segmentation
- 1000 different categories

From 2015 onward, the error rate dropped below human-level. Apparently that “human” was a Tesla developer who tried it himself.
The CNN models described below were validated in this competition.
AlexNet

AlexNet split the network into two because GPU resources were limited at the time, training on two separate GPUs.
An 11x11 filter was applied to the input. This was not a great choice — the receptive field gets wider, but the number of parameters increases proportionally.
Key point
- ReLU
- Various interpretations exist, but it is an effective activation function that does not break the network even when it gets deep.
- 2 GPU
- LRN (Local response normalization)
- Suppresses strongly activated regions.
- Rarely used nowadays.
- But data augmentation is always used. Not sure if the terminology is right..
- Overlapping pooling
- Data augmentation
- Dropout
In 2021, these are standard techniques, but in 2012 they were novel benchmarks for deep learning.
ReLU

- Has good properties of linear models.
- Even when gradient values get very large, it can maintain them.
- Gradient descent works well.
- Good generalization.
- Solves gradient vanishing.
- Previous activation functions had gradients approaching 0 when inputs got very large, causing vanishing.
VGGNet
 ICLR 2015 winner
- Uses only 3x3 convolution filters (with stride 1)
- 1x1 convolution for fully connected layers
- Not used to reduce parameters like the modern use of 1x1 filters.
- Dropout (p=0.5)

Receptive field: the size of the region received through the filter.
Passing through a 3x3 filter twice gives the same 5x5 receptive field as using a 5x5 filter once.
But the parameter count differs by almost 1.5x. The calculation is shown above.
Therefore, most CNN papers after this use 3x3 or 5x5 filters, and 7x7 at most. It shows how inefficient AlexNet’s 11x11 was.
In summary, to increase the receptive field, stacking multiple smaller filters is much more advantageous.
GoogLeNet
I looked for papers with a lowercase l — turns out they write L uppercase. 
- ILSVRC 2014 winner
- NIN (Network in Network) — a similar-shaped network exists within the network.
- Uses Inception blocks
Inception block

- Concatenates convolution results from multiple paths
- 1x1 convolution reduces the number of parameters
- 1x1 has the effect of reducing dimension in the channel direction
1x1 convolution
 Standard convolution: 3x3 filters with 128 channels. Since the output has 128 channels, 128 filters are needed.
So 3x3x128x128 = 147,456 parameters are needed.
1x1 convolution: The intermediate output has 32 channels. That is, there are 32 1x1x128 filters. After the 3x3 convolution, the result has 128 channels. That is, there are 128 3x3x32 filters. Expressed as the formula above. Combined, 40,960 parameters are needed.
Effect of mixing in 1x1 convolution: Reduced the number of parameters while maintaining input, output, and receptive field size.
CNN model intermediate comparison
Number of parameters:
- AlexNet (8 layers): 60M
- VGGNet (19 layers): 110M
- GoogLeNet (22 layers): 4M
ResNet
Written by the famous Kaiming He, apparently. I didn’t know who that was..
Background

- Overfitting occurs with an excessive number of parameters.
- A 56-layer network cannot learn better than a 20-layer network no matter how much you train it.
Skip connection
 Let’s learn only the residuals.
—

Adding skip connections means more layers lead to better learning.

- Simple shortcut
- Simply adds the input and convolution result
- Commonly used
- Projected shortcut
- Adds a 1x1 convolution and convolution result
- Rarely used
- Batch normalization
- In the ResNet paper, placed after convolution and before the activation function.
- Controversial. Some say conv->relu->bn performs better, others say it is better not to use bn at all.
Bottleneck architecture

Left is the original network, right is the bottleneck architecture.
By freely adding 1x1 filters before and after convolution, it matches input and output dimensions while reducing parameter count.
CNN model comparison

Performance increased, and parameters decreased!
DenseNet
 In ResNet’s skip connection, adding results would mix the values. So let’s concatenate instead. Since the dimensions are the same, there should be no issue.
Very useful for simple classification!
Problem Dimensions will double each time.
Dense Block, Transition Block
 To solve this problem, dimensions are reduced while constructing the network.
That is, when dimensions grow through a Dense block, they are reduced through a Transition block.
Transition block = bn -> 1x1 conv -> 2x2 AvgPooling