The Gradient Problem
Suppose you have a function built from hundreds of simple operations chained together — like a neural network computing its loss from its parameters. To train with gradient descent, you need every partial .
The obvious approach: symbolically differentiate through the chain. But applied recursively to a complex network, the chain rule recomputes the same intermediate expressions exponentially many times. For a circuit with 1000 nodes, this can be catastrophic.
Backpropagation is smarter. It computes all partials at a fixed input using exactly two passes over the circuit — one forward, one backward. Intermediate values computed forward get reused backward, eliminating redundancy. The total cost is roughly the same as evaluating twice.
Formal View
Why This Matters
Every modern neural network — GPT, image classifiers, AlphaFold — is trained using backpropagation.
- Training deep neural networks with millions of parameters efficiently
- PyTorch autograd and JAX are implementations of this exact algorithm
- Physics simulators and differentiable renderers that need gradients
- Gradient-based hyperparameter optimization and meta-learning
Learning Resources
What is backpropagation really doing?
3Blue1Brown
Visual intuition for how gradients flow backward through a neural network.
The spelled-out intro to neural networks and backpropagation: building micrograd
Andrej Karpathy
Build a complete automatic differentiation engine from scratch in Python.
Backpropagation, intuitively
StatQuest with Josh Starmer
Step-by-step walkthrough of a full backprop calculation with numbers.
Quiz
Why is naive symbolic differentiation inefficient for large circuits?
Backpropagation computes all partials using:
Backpropagation computes exact partial derivatives (not approximations) at a specific point .
The circuit used for backpropagation must be acyclic (no loops).
Backpropagation computes the gradient of the circuit output with respect to:
Common Mistakes
- Confusing backpropagation with gradient descent — backprop only *computes* the gradients; gradient descent uses them to update parameters.
- Thinking backprop produces symbolic derivatives — it computes numerical values at a specific , not symbolic formulas.
- Assuming each input needs its own backward pass — that is exactly the naive approach backprop avoids.