Neural Networks
Some say neural networks work well because they mimic the neural networks in the human brain.  There’s some truth to that. The implementation of NN nodes does resemble the structure of actual neurons.
But it’s a stretch to say they mimic the brain, since processes like back propagation are essential to NNs.

Early airplanes were modeled after bats and birds. The Wright brothers’ airplane had that shape to some extent. But you can’t say modern aircraft mimic bats or birds.
Same with NNs. They started from mimicking the human brain to replicate human intelligence, but recent DL research diverges significantly from how humans operate.
The point is: don’t just assume NNs work because they mimic humans. Analyze mathematically why they work.
Define
Neural networks are function approximators that stack affine transformations followed by nonlinear transformations.
- They approximate functions.
- Implemented nonlinearly through activation functions.
Linear Neural Network
Simple Data

Let’s define data, model, and loss as above. Now let’s find the optimal w and b, the model’s parameters.
 
It’s a linear regression problem with a convex function and a small training set, so there are definitely methods to find optimal w and b in one shot. But in DL, we use back propagation as shown above.
Back propagation goal: update parameters in the direction that minimizes the loss. I’ve explained back propagation in detail in another post, so I won’t elaborate further here.

More Large Data
 Weights are represented using matrices. The goal is to send x to y through W and b.
More Layer Stack

To stack more layers, we can express it as a product of weights. The form above is a nesting of the earlier basic formula (bias is omitted).
The intention was to create a multi-layer with hidden layers, but the result is still effectively a single layer. Because W2 and W1 collapse into a single weight through matrix multiplication.
 So a nonlinear transform must be applied first, then combined with a linear transform, for the layer-stacking effect to kick in.
Activation Functions
 Which one is best depends on the problem.
Beyond Linear Neural Networks
 On any compact set K, any continuous function can be approximated as closely as desired with just one hidden layer. => This only implies existence. It doesn’t guarantee that the NN I train will approximate the function I want.
It only demonstrates the expressive power of NNs.
Loss Function

Cross entropy is used in classification problems.
Labels in classification are typically represented as one-hot vectors. Only the dimension to be classified has a value; the rest are all 0. The value itself doesn’t matter — it could be 1 or 1000000. What matters is that it’s distinct from other values.
Cross entropy is used to express this property mathematically.
Alternatively, suppose we’re building a model to estimate age groups from face photos. In that case, the output is usually expressed as probabilities, implemented using log likelihood through MSE.