Putting It All Together
The full backpropagation algorithm is built from just two rules applied in sequence. Here is the complete procedure:
1 — Forward pass: Set inputs to . Propagate left to right, storing at each function node. Record .
2 — Seed: Place the number 1 on the output wire. This represents .
3 — Backward pass: Process nodes right to left. Function node: multiply the incoming gradient by each local derivative and place the result on each input wire. Splitter node: add all incoming gradients and place the sum on the input wire.
4 — Read results: The numbers that arrive at each are exactly . Done.
The total cost is proportional to circuit size — roughly one extra forward pass worth of work. Computing all partial derivatives costs no more than computing itself twice. That is what makes training networks with millions of parameters tractable.
Formal View
Why This Matters
Backpropagation is arguably the most impactful algorithm in modern machine learning. Without it, training deep networks would require forward passes — impractical for millions of parameters.
- Training all modern neural networks: LLMs, image classifiers, speech models, protein folding
- Differentiable programming: compiling arbitrary code to be differentiable
- Second-order optimization by running backprop through backprop
- Scientific computing where gradients of physical simulations are needed
Learning Resources
What is backpropagation really doing?
3Blue1Brown
The complete algorithm assembled, with neural network context.
Backpropagation, step by step
StatQuest with Josh Starmer
Complete numerical walkthrough of both forward and backward passes.
The spelled-out intro to neural networks and backpropagation: building micrograd
Andrej Karpathy
Implements the complete algorithm in Python, then scales it to train a small language model.
Quiz
Computational cost of backpropagation relative to one forward pass?
Backward pass processes nodes in which order?
After backpropagation, the numbers on input wires represent:
Changing requires rerunning both the forward and backward passes.
Which correctly summarizes the two backward-pass rules?
Common Mistakes
- Swapping the rules: adding at function nodes and multiplying at splitters.
- Forgetting to seed with 1 on the output wire.
- Running the backward pass before completing the forward pass — you need stored values.
- Interpreting gradient numbers as an approximation rather than an exact value at .