Neural network
NN in Linear Regression

A classic example that makes good use of matrix operations is NN.
Matrix X holds the data. W sends X’s data into a different dimension.
Matrix b adds the y-intercept to each column vector at once.
The original X matrix with dimensions (X, d) gets transformed to (n, p).
Interpretation

X, which was d-dimensional, gets connected to p dimensions.
Each arrow represents one variable of the W vector. Since xd points to p outputs, there are d x p arrows, which matches the dimensions of W.
NN in Classification
Softmax
 
In classification, softmax is combined with a vector to produce a probability vector. In other words, combining softmax with a linear model lets us interpret the model’s output in the desired format.
Softmax Implementation
def softmax(vec): denumerator = np.exp(vec - np.max(vec, axis=-1, keepdims=True)) numerator = np.sum(denumerator, axis=-1, keppdims=True) val = denumerator / numerator return valnp.max is added to prevent overflow. The original softmax computation result is preserved.
Prediction
For prediction, we don’t use softmax; we only use methods like one-hot encoding. I think this is because probabilities already come out of the NN’s output.
def one_hot(val, dim): return [np.eye(dim)[_] for _ in val]
def one_hot_encoding(vec): vec_dim = vec.shape[1] vec_argmax = np.argmax(vec, axis=-1) return one_hot(vec_argmax, vec_dim)Activation Function
- Activation functions convert linear outputs into nonlinear ones.
- The vector transformed by an activation function = hidden vector, latent vector, neuron
- Neural network (NN) = a model composed of neurons
- Perceptron = a traditional model composed only of neurons 
The difference from softmax is that softmax considers all variable values, whereas activation functions are applied to individual real numbers. ?? I thought softmax was also an activation function, but I was wrong.
Definition
A nonlinear function defined on real numbers. An NN without activation functions is no different from a linear model!
Types

Traditionally, sigmoid and tanh were used. Recently, ReLU and its variants are used.
NN (Neural Network)
Definition
A function that composes linear models with activation functions. 
Layers are stacked by repeatedly transforming input z into latent vector h within the network. The figure above is a two-layer NN. Generalizing this gives the following:

As mentioned above, when activation functions are applied, they operate individually on each real number within a vector.
Why Use More Than 2 Layers
Universal approximation theorem
- Even a 2-layer network can approximate any continuous function
- But it’s hard to achieve in practice
The deeper the layers, the fewer neurons are needed to approximate the target function.
So deeper NNs are typically used. But optimization gets harder.
Forward Propagation
Following the NN’s layer-stacking process as-is to adjust weights.
Back Propagation
lol ..
Parameter Update in Linear Models
A linear model can be thought of as having a single layer. That is, all parameters get updated at once.
Parameter Update in NN
An NN, on the other hand, consists of multiple layers. So all parameters can’t be updated at once. It has to be done sequentially.
Principle
 Final goal: update all parameters used across L layers.

Using the chain rule of differentiation, parameters are updated by going backward from the output to the input.
Chain-Rule-Based Auto-Differentiation
The chain rule is the same concept from high school math. 
For this to work via chain rule, the computer needs to know the tensors at each node.
Forward propagation, on the other hand, just computes sequentially, so it’s more memory-efficient than back propagation.

In the figure above, blue arrows represent forward propagation and red arrows represent back propagation. It shows the process of using the chain rule to compute the gradient vector for W1.
