The Splitter Node Rule
Splitter nodes handle the case where one value feeds multiple downstream computations. A splitter takes input and produces copies . In the backward pass, it receives two gradient numbers — one from each output wire. Its job: compute the gradient to place on input wire .
There are two paths from to the output (through and through ). When a variable affects the output through multiple routes, the total derivative is the sum of each route's contribution — the multivariate chain rule.
Since the splitter just copies (, local partial = 1 for each), the gradient on is simply: add the two incoming gradients. More generally, if fans out to copies, add all incoming gradients.
This sum rule explains why parameters that appear many times in a model receive large accumulated gradients.
Formal View
Two paths from to output → sum the two gradient contributions. The splitter's local Jacobian is , so chain rule gives this sum.
Why This Matters
The sum rule explains why shared parameters (weight tying, attention heads) accumulate gradients from all their uses.
- Shared weights in RNNs: same matrix used at each timestep, gradients sum over all steps
- Residual connections (skip connections) in ResNets: gradient flows through two paths and adds
- Attention mechanisms reuse the same key/query/value matrices
- Gradient accumulation for large effective batch sizes exploits this additive structure
Quiz
Splitter sends to nodes and . Node sends back gradient 3; node sends back 7. Gradient on ?
Why does the splitter add (not multiply) incoming gradients?
A splitter with outputs must collect all upstream gradients before it can send a gradient backward.
A residual block computes . Using the splitter rule, is:
If feeds a single downstream node (no splitter), the function node rule applies directly to .
Common Mistakes
- Swapping the rules: adding at function nodes (should multiply) and multiplying at splitters (should add).
- Forgetting the splitter must wait for ALL downstream gradients before computing its sum.
- Taking the average of incoming gradients rather than the sum.