12.77 min read

Steepest Descent Direction

For minimization, we want to decrease $f$ as rapidly as possible. The direction of steepest descent is $-\nabla f(\mathbf{a})/\|\nabla f(\mathbf{a})\|$ — exactly opposite to the gradient.

Moving a small step $\alpha > 0$ in the steepest descent direction gives: $f(\mathbf{a} - \alpha \nabla f(\mathbf{a})) \approx f(\mathbf{a}) - \alpha\|\nabla f(\mathbf{a})\|^2 < f(\mathbf{a})$ . This decrease is guaranteed for sufficiently small $\alpha$ (as long as the gradient is nonzero).

This is the mathematical justification for gradient descent: at each step, update $\mathbf{x} \leftarrow \mathbf{x} - \alpha \nabla f(\mathbf{x})$ . The parameter $\alpha > 0$ is called the step size or learning rate. Too small: slow convergence. Too large: may overshoot and diverge.

Formal View

Definition 12.3 — Steepest Descent Direction

The steepest descent direction at

\mathbf{a}

(where

\nabla f(\mathbf{a}) \neq \mathbf{0}

) is

\mathbf{d}^* = -\frac{\nabla f(\mathbf{a})}{\|\nabla f(\mathbf{a})\|}

This is the unit vector minimizing

D_\mathbf{u} f(\mathbf{a}) = \nabla f(\mathbf{a}) \cdot \mathbf{u}

over all unit vectors

\mathbf{u}

Why This Matters

Steepest descent is the simplest effective optimization algorithm and forms the core of deep learning training.

Gradient descent for training neural networks (the negative gradient gives the training signal)
Physics-inspired methods: gradient flow equations $\dot{\mathbf{x}} = -\nabla f(\mathbf{x})$
Image denoising: iteratively moving in the gradient direction of a regularization functional

Learning Resources

Gradient Descent

StatQuest

Step-by-step explanation of gradient descent with examples.

15 min

Gradient Descent, How Neural Networks Learn

3Blue1Brown

How gradient descent drives learning in neural networks.

21 min

Quiz

Question 1

Moving one step in the direction $-\nabla f(\mathbf{a})$ (not normalized) always decreases $f$ .

Question 2

In gradient descent, the update rule is $\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)$ . What is $\alpha$ ?

Common Mistakes

Moving in the gradient direction (ascent) instead of the negative gradient (descent) when minimizing.
Choosing a learning rate that is too large, causing divergence.
Confusing steepest descent (a direction) with gradient descent (an algorithm/iteration).