Linear Algebra
15.78 min read

Putting It All Together

The full backpropagation algorithm is built from just two rules applied in sequence. Here is the complete procedure:

1 — Forward pass: Set inputs to t0\mathbf{t}_0. Propagate left to right, storing s0\mathbf{s}_0 at each function node. Record f0=f(t0)f_0 = f(\mathbf{t}_0).

2 — Seed: Place the number 1 on the output wire. This represents ff=1\frac{\partial f}{{\partial f}} = 1.

3 — Backward pass: Process nodes right to left. Function node: multiply the incoming gradient by each local derivative and place the result on each input wire. Splitter node: add all incoming gradients and place the sum on the input wire.

4 — Read results: The numbers that arrive at each tit_i are exactly fti(t0)\frac{\partial f}{{\partial t_i}}(\mathbf{t}_0). Done.

The total cost is proportional to circuit size — roughly one extra forward pass worth of work. Computing all nn partial derivatives costs no more than computing ff itself twice. That is what makes training networks with millions of parameters tractable.

Formal View

Theorem 15.1 — Backpropagation Correctness
The backpropagation algorithm correctly computes fti(t0)\frac{\partial f}{\partial t_i}(\mathbf{t}_0) for all ii simultaneously. The time complexity is O(circuit)O(|\text{circuit}|) — proportional to circuit size, independent of the number of inputs nn. This is optimal up to constants.

Why This Matters

Backpropagation is arguably the most impactful algorithm in modern machine learning. Without it, training deep networks would require O(n)O(n) forward passes — impractical for millions of parameters.

  • Training all modern neural networks: LLMs, image classifiers, speech models, protein folding
  • Differentiable programming: compiling arbitrary code to be differentiable
  • Second-order optimization by running backprop through backprop
  • Scientific computing where gradients of physical simulations are needed

Quiz

Question 1

Computational cost of backpropagation relative to one forward pass?

Question 2

Backward pass processes nodes in which order?

Question 3

After backpropagation, the numbers on input wires tit_i represent:

Question 4

Changing t0\mathbf{t}_0 requires rerunning both the forward and backward passes.

Question 5

Which correctly summarizes the two backward-pass rules?

Common Mistakes

  • Swapping the rules: adding at function nodes and multiplying at splitters.
  • Forgetting to seed with 1 on the output wire.
  • Running the backward pass before completing the forward pass — you need stored values.
  • Interpreting gradient numbers as an approximation rather than an exact value at t0\mathbf{t}_0.