Convolution

In formulas

Role

It can extract desired features.

For example, suppose you use a (3,3) kernel where all values are 1/9. That becomes an averaging convolution operation.

Tensor

Consider an RGB image with 3 channels. To apply a (5,5) filter to this image, think of it as applying a filter with 3 channels.

For example, applying 4 (5,5,3) filters to an RGB image as above produces 4 (28,28) features each with 1 channel.

Stack of convolution

Like MLP, layers are stacked by passing through a non-linear function.

Convolution and Neural networks

The figure above shows the most classic CNN.

Convolution and pooling layers: feature extraction Fully connected layers: decision making (e.g., classification, regression)

The trend nowadays is to reduce fully connected layers. Because reducing the number of parameters makes training easier and improves generalization performance.

Stride

The kernel moves by stride amount while performing convolution. Since it is 1D, the stride value is also 1D.

Padding

Convolution cannot be performed at the edges. So we fill in arbitrary values and perform convolution on the image edges. e.g., zero padding = fill the padded area with 0.

With padding, the spatial dimensions of input and output can be matched.

undefined

Counting parameters

Parameters of a convolution operation = parameters of the kernel

Padding(1), Stride(1), 3x3 kernel

We say 3x3 kernel, but as mentioned earlier, the kernel’s channel matches the input’s channel.
So we use a (3,3,128) kernel.
Convolving the channel-matched kernel with the input always produces 1 channel.
The output has 64 channels.
Therefore, 64 (3,3,128) kernels must exist.

Getting an approximate sense of the parameter count through this process is important!

Alexnet

The parameter counts between convolution layers and dense layers are vastly different! The reasons:

Convolution shares the same weights through the kernel.
- The same kernel is used regardless of where the element is in the input image.
Dense layers have different weights for every node, as we know.

1x1 convolution

1x1 convolution cannot see a region. Obviously — it is a kernel that only repeats convolution over a 1x1 area.

But it can serve the following purposes:

Channel (dimension) reduction
Expected parameter reduction as depth increases
e.g., bottleneck architecture