The Function Node Rule
Here is the core of backpropagation: how does a function node pass gradient information backward?
Node computes from inputs . During the backward pass, it receives a number along its output wire — the "upstream gradient" . Its job: produce a gradient for each input wire .
The answer is a product: upstream gradient × local gradient. The local gradient for input is , computable from the stored . The node multiplies and sends the result backward along wire .
This is just the chain rule, applied one node at a time. Each node does a tiny local computation — no node needs to know the structure of the rest of the circuit.
Formal View
One application of the multivariate chain rule. The product has exactly two factors — upstream (from the right) and local (computed from stored ).
Interactive Visualization
Backpropagation Circuit
Why This Matters
Every PyTorch operation (matmul, ReLU, softmax, etc.) implements exactly this rule in its .backward() method.
- ReLU backward: local gradient is 1 if input > 0, else 0
- Multiply node: gradient to each input = other input × upstream gradient
- Sigmoid backward: local gradient is , evaluated at the stored input
- Matrix multiply backward: gradient = upstream × transpose of the other matrix factor
Learning Resources
Backpropagation, step by step
StatQuest with Josh Starmer
Walks through the function node rule with explicit numerical examples.
The spelled-out intro to neural networks and backpropagation: building micrograd
Andrej Karpathy
Derives the backward rule for add, multiply, tanh, and other operations from scratch.
Quiz
Node receives upstream gradient . Stored: , . What gradient goes on wire ?
For node , stored , upstream gradient . What gradient goes on wire ?
What two values does a function node multiply together in the backward pass?
A function node with two inputs sends the same gradient to both input wires.
Why does the function node need the stored ?
Common Mistakes
- Multiplying by the forward value instead of by the local derivative .
- Sending the same gradient to all input wires — each gets a different local partial.
- Confusing the upstream gradient (arriving from the right) with the gradient being sent leftward (what the node produces).