Linear Algebra
12.77 min read

Steepest Descent Direction

For minimization, we want to decrease ff as rapidly as possible. The direction of steepest descent is f(a)/f(a)-\nabla f(\mathbf{a})/\|\nabla f(\mathbf{a})\| — exactly opposite to the gradient.

Moving a small step α>0\alpha > 0 in the steepest descent direction gives: f(aαf(a))f(a)αf(a)2<f(a)f(\mathbf{a} - \alpha \nabla f(\mathbf{a})) \approx f(\mathbf{a}) - \alpha\|\nabla f(\mathbf{a})\|^2 < f(\mathbf{a}). This decrease is guaranteed for sufficiently small α\alpha (as long as the gradient is nonzero).

This is the mathematical justification for gradient descent: at each step, update xxαf(x)\mathbf{x} \leftarrow \mathbf{x} - \alpha \nabla f(\mathbf{x}). The parameter α>0\alpha > 0 is called the step size or learning rate. Too small: slow convergence. Too large: may overshoot and diverge.

Formal View

Definition 12.3 — Steepest Descent Direction
The steepest descent direction at a\mathbf{a} (where f(a)0\nabla f(\mathbf{a}) \neq \mathbf{0}) is
d=f(a)f(a)\mathbf{d}^* = -\frac{\nabla f(\mathbf{a})}{\|\nabla f(\mathbf{a})\|}
This is the unit vector minimizing Duf(a)=f(a)uD_\mathbf{u} f(\mathbf{a}) = \nabla f(\mathbf{a}) \cdot \mathbf{u} over all unit vectors u\mathbf{u}.

Why This Matters

Steepest descent is the simplest effective optimization algorithm and forms the core of deep learning training.

  • Gradient descent for training neural networks (the negative gradient gives the training signal)
  • Physics-inspired methods: gradient flow equations x˙=f(x)\dot{\mathbf{x}} = -\nabla f(\mathbf{x})
  • Image denoising: iteratively moving in the gradient direction of a regularization functional

Quiz

Question 1

Moving one step in the direction f(a)-\nabla f(\mathbf{a}) (not normalized) always decreases ff.

Question 2

In gradient descent, the update rule is xk+1=xkαf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k). What is α\alpha?

Common Mistakes

  • Moving in the gradient direction (ascent) instead of the negative gradient (descent) when minimizing.
  • Choosing a learning rate that is too large, causing divergence.
  • Confusing steepest descent (a direction) with gradient descent (an algorithm/iteration).