Gradient Descent Basics

I’ve organized this in Notion multiple times before, but here’s another pass centered on what I learned at Boostcamp.

Differentiation

1
import sympy as sym
2
from sympy.abc import x
3

4
sym.diff(sym.poly(x**2 + 2*x + 3), x)

Gradient Ascent

f(x) Add the derivative to x to find the location of the function’s local maximum. Used when the objective function needs to be maximized.

Gradient Descent

f(x) Subtract the derivative from x to find the location of the function’s local minimum. Used when the objective function needs to be minimized.

Algorithm

1
# gradient: function that computes the derivative
2
# init: starting point
3
# lr: learning rate
4
# eps: epsilon
5

6
var = init
7
grad = gradient(var)
8
while (abs(grad) > eps):
9
  var = var - lr * grad
10
    grad = gradient(var)

The goal is for the derivative to reach 0, but computers can’t represent that exactly. So a very small real value, epsilon, is used as the termination condition.

Partial Differentiation

Variables in ML are usually vectors, so partial derivatives are needed instead of regular derivatives to establish directionality. Same as what you’d do in a calculus class.

undefined

Here, ei is a unit vector with 1 at the i-th position and 0 elsewhere. It filters out only the desired component for differentiation.

Gradient Vector

Nabla

When the function takes a vector as input, partial derivatives must be used, and the number of variables can get very large.

So we collect the partial derivative results for all variables back into a vector and use that for gradient descent. This is called the gradient vector, and its advantage is enabling simultaneous updates across all variables.

undefined

This symbol is called nabla.

Gradient Vector Visualization

![](/assets/images/Gradient descent 기본/8e486de8-6882-48bb-89bb-0778f81aa965-image.png) ![](/assets/images/Gradient descent 기본/7c31e669-58b0-43d4-bf3a-85f7b0118c64-image.png)

Contour plots make this easy to understand. Relative to the contour lines, the vector direction points toward the fastest decrease toward the origin.

Algorithm Using Gradient Vector

1
# gradient: function that computes the gradient vector
2
# init: starting point
3
# lr: learning rate
4
# eps: epsilon
5

6
var = init
7
grad = gradient(var)
8
while (norm(grad) > eps):
9
  var = var - lr * grad
10
    grad = gradient(var)

The differences are the definition of gradient and using norm instead of abs in the termination condition.