Linear Algebra
14.18 min read

Motivation for the Chain Rule

Many practical functions are naturally expressed as compositions: h=gfh = g \circ f means h(x)=g(f(x))h(\mathbf{x}) = g(\mathbf{f}(\mathbf{x})). To optimize or analyze hh, we need to differentiate it. The chain rule gives us the derivative of a composition in terms of the derivatives of its parts.

Example: the loss function in a neural network is a composition of many layers. Differentiating the loss to train the network requires applying the chain rule repeatedly — this is exactly what backpropagation does.

The chain rule is also essential for change of variables: when solving differential equations in a different coordinate system, or converting integrals, the Jacobian of the coordinate transformation arises via the chain rule.

Formal View

Remark 14.1 — Why We Need the Chain Rule
Given f:RnRk\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^k and g:RkRm\mathbf{g}: \mathbb{R}^k \to \mathbb{R}^m, the composition h=gf:RnRm\mathbf{h} = \mathbf{g} \circ \mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m is defined by h(x)=g(f(x))\mathbf{h}(\mathbf{x}) = \mathbf{g}(\mathbf{f}(\mathbf{x})). The chain rule states Jh(a)=Jg(f(a))Jf(a)J\mathbf{h}(\mathbf{a}) = J\mathbf{g}(\mathbf{f}(\mathbf{a})) \cdot J\mathbf{f}(\mathbf{a}) — matrix multiplication of the two Jacobians.

Why This Matters

The chain rule is the central differentiation theorem, underlying backpropagation, implicit differentiation, and change-of-variables.

  • Backpropagation: chain rule applied layer by layer through a neural network
  • Implicit differentiation: differentiating equations like F(x,y)=0F(x, y) = 0 to find dy/dxdy/dx
  • Change of coordinates in integrals: the substitution formula uses the chain rule via the Jacobian determinant

Quiz

Question 1

The chain rule for Jacobians states J(gf)(a)J(\mathbf{g}\circ\mathbf{f})(\mathbf{a}) equals:

Question 2

Backpropagation in neural networks is an application of the chain rule.

Common Mistakes

  • Evaluating JgJ\mathbf{g} at a\mathbf{a} instead of at f(a)\mathbf{f}(\mathbf{a}).
  • Reversing the order of matrix multiplication in the chain rule.
  • Applying the chain rule to sums instead of compositions.