12.810 min read

Gradient Descent Algorithm

Gradient descent is an iterative algorithm for finding a local minimum of a differentiable function $f$ . Starting from an initial point $\mathbf{x}_0$ , it repeatedly takes steps in the direction of steepest descent:

$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)$

The step size (learning rate) $\alpha_k$ can be fixed or adapted at each step. Common strategies include: (1) fixed step size: $\alpha_k = \alpha$ for all $k$ ; (2) line search: choose $\alpha_k = \arg\min_{\alpha > 0} f(\mathbf{x}_k - \alpha\nabla f(\mathbf{x}_k))$ ; (3) diminishing step sizes: $\alpha_k \to 0$ at an appropriate rate.

Gradient descent converges to a critical point (where $\nabla f = \mathbf{0}$ ) under mild conditions. For convex functions, any critical point is a global minimum.

Formal View

Algorithm 12.1 — Gradient Descent

Input: differentiable

f

, initial point

\mathbf{x}_0

, step size

\alpha > 0

, tolerance

\epsilon > 0

Repeat:

\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)

Until:

\|\nabla f(\mathbf{x}_k)\| < \epsilon

Return:

\mathbf{x}_k

as approximate minimizer.

Theorem 12.6 — Convergence for Convex Functions

f

is convex and

C^1

with

L

-Lipschitz gradient, then gradient descent with step size

\alpha \leq 1/L

satisfies

f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{\|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2\alpha k}

This is an $O(1/k)$ convergence rate. Strongly convex functions give exponential $O(\rho^k)$ convergence.

Interactive Visualization

Gradient Descent Visualizer

Why This Matters

Gradient descent (and its variants) is the engine behind virtually all modern machine learning.

Training neural networks via stochastic gradient descent (SGD)
Solving large-scale linear regression, logistic regression, and SVMs
Scientific computing: solving variational problems and PDE-constrained optimization

Learning Resources

Gradient Descent Convergence

MIT OpenCourseWare

MIT lecture on gradient descent convergence theory.

50 min

Gradient Descent Deep Dive

StatQuest

Intuitive explanation of gradient descent with worked examples.

15 min

Quiz

Question 1

Gradient descent terminates when:

Question 2

For a non-convex function, gradient descent is guaranteed to find the global minimum.

Common Mistakes

Setting the learning rate too large, causing oscillation or divergence.
Stopping too early (gradient not small enough) and returning a point far from the minimum.
Assuming gradient descent finds the global minimum for non-convex objectives.