CNN

The fully connected layer in MLP has a very large weight matrix.

CNN, on the other hand, uses a fixed input vector called a kernel.

Kernel V is applied for all i.
It moves across x by the kernel size and is applied.
The convolution operation excluding the activation function is also a linear transformation.

Formulas

The formulas for continuous and discrete cases are as follows.

Convolution locally amplifies or attenuates a signal to extract or filter information.

Cross-correlation

Cross-correlation is the convolution operation joined with +. In practice, cross-correlation is used when implementing CNN. Traditionally cross-correlation was called convolution, but they are actually different operations.

Convolution operation

Translation invariant: the kernel does not change as it moves within the domain. Also, the kernel is applied only locally to the signal. undefined undefined

Convolution example in images

Interactive demo: https://setosa.io/ev/image-kernels/

Multi-dimensional convolution formulas

Applying convolution

f is the kernel, g is the input.
The coordinates for the input are (i, j).
In the example, the ranges of p and q are 0-1 and 0-1 respectively. That is, the ranges of p and q serve to pair each element of the kernel with an element of the input matrix.
Each pair is multiplied element-wise and summed.
This is repeated without exceeding the input boundaries.

Estimating convolution output size

Input size = (H, W)
Kernel size = (KH, KW)
Output size = (OH, OW)

2D convolution

From 3 dimensions onward, it is called a tensor, not a matrix.

When a 2D input comes in with 3 channels, convolution is performed as shown above. A kernel is created for each channel, and convolution is performed between each channel’s kernel and 2D input. Then all results are summed.

This is illustrated as follows.

A 3D kernel and 3D input are prepared. Of course, it became 3D because we assumed channels for a 2D input.

When performing convolution between 3D and 3D, it produces a 2D output with 1 channel. This is because kernels for all channels have been prepared.

If you want the 2D output to have multiple channels instead of 1, just create multiple 3D kernel tensors and apply them!

CNN backpropagation

When computing backpropagation, convolution operations appear as well. It sounds complicated, but the formula is as follows.

f: kernel
g: signal (input)
Goal: differentiate the convolution of f and g

To differentiate with respect to x, only g contains the x term, so the derivative applies only to g. In other words, as shown in the second line of the formula, it becomes a convolution of f and the derivative of g!

This applies equally in the discrete case.

Example

Suppose we perform convolution with input and kernel as vectors. The results are stored in the output vector.

Assume the error is computed from the loss function and its derivative has reached the output vector through backpropagation.

This might be confusing, but looking at the figure above: X3 and W3 are multiplied to produce O1. Similarly, X3 and W2 are multiplied for O2, and X3 and W1 for O3.

In the same way, the derivatives are multiplied with W3, W2, W1 of the kernel and delivered to X3.

The kernel is updated in the same way, apparently. I don’t fully understand this part, honestly…

Putting it all together, even backpropagation proceeds identically to a convolution operation!