Linear Algebra
12.810 min read

Gradient Descent Algorithm

Gradient descent is an iterative algorithm for finding a local minimum of a differentiable function ff. Starting from an initial point x0\mathbf{x}_0, it repeatedly takes steps in the direction of steepest descent:

xk+1=xkαkf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)

The step size (learning rate) αk\alpha_k can be fixed or adapted at each step. Common strategies include: (1) fixed step size: αk=α\alpha_k = \alpha for all kk; (2) line search: choose αk=argminα>0f(xkαf(xk))\alpha_k = \arg\min_{\alpha > 0} f(\mathbf{x}_k - \alpha\nabla f(\mathbf{x}_k)); (3) diminishing step sizes: αk0\alpha_k \to 0 at an appropriate rate.

Gradient descent converges to a critical point (where f=0\nabla f = \mathbf{0}) under mild conditions. For convex functions, any critical point is a global minimum.

Formal View

Algorithm 12.1 — Gradient Descent
Input: differentiable ff, initial point x0\mathbf{x}_0, step size α>0\alpha > 0, tolerance ϵ>0\epsilon > 0 Repeat: xk+1=xkαf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k) Until: f(xk)<ϵ\|\nabla f(\mathbf{x}_k)\| < \epsilon Return: xk\mathbf{x}_k as approximate minimizer.
Theorem 12.6 — Convergence for Convex Functions
If ff is convex and C1C^1 with LL-Lipschitz gradient, then gradient descent with step size α1/L\alpha \leq 1/L satisfies f(xk)f(x)x0x22αkf(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{\|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2\alpha k}.

This is an O(1/k)O(1/k) convergence rate. Strongly convex functions give exponential O(ρk)O(\rho^k) convergence.

Interactive Visualization

Gradient Descent Visualizer

Why This Matters

Gradient descent (and its variants) is the engine behind virtually all modern machine learning.

  • Training neural networks via stochastic gradient descent (SGD)
  • Solving large-scale linear regression, logistic regression, and SVMs
  • Scientific computing: solving variational problems and PDE-constrained optimization

Quiz

Question 1

Gradient descent terminates when:

Question 2

For a non-convex function, gradient descent is guaranteed to find the global minimum.

Common Mistakes

  • Setting the learning rate too large, causing oscillation or divergence.
  • Stopping too early (gradient not small enough) and returning a point far from the minimum.
  • Assuming gradient descent finds the global minimum for non-convex objectives.