15.67 min read

The Splitter Node Rule

Splitter nodes handle the case where one value feeds multiple downstream computations. A splitter takes input $s$ and produces copies $x_1, x_2$ . In the backward pass, it receives two gradient numbers — one from each output wire. Its job: compute the gradient to place on input wire $s$ .

There are two paths from $s$ to the output $f$ (through $x_1$ and through $x_2$ ). When a variable affects the output through multiple routes, the total derivative is the sum of each route's contribution — the multivariate chain rule.

Since the splitter just copies ( $x_1 = x_2 = s$ , local partial = 1 for each), the gradient on $s$ is simply: add the two incoming gradients. More generally, if $s$ fans out to $k$ copies, add all $k$ incoming gradients.

This sum rule explains why parameters that appear many times in a model receive large accumulated gradients.

Formal View

Lemma 15.2 — Splitter Node Lemma

Let a splitter take input

s

and produce copies

x_1, x_2

. Then:

\frac{\partial f(s, \mathbf{t})}{\partial s}(s_0, \mathbf{t}_0) = \frac{\partial f(x_1, \mathbf{t})}{\partial x_1}([x_1]_0, \mathbf{t}_0) + \frac{\partial f(x_2, \mathbf{t})}{\partial x_2}([x_2]_0, \mathbf{t}_0)

Two paths from $s$ to output → sum the two gradient contributions. The splitter's local Jacobian is $[1, 1]^t$ , so chain rule gives this sum.

Why This Matters

The sum rule explains why shared parameters (weight tying, attention heads) accumulate gradients from all their uses.

Shared weights in RNNs: same matrix used at each timestep, gradients sum over all steps
Residual connections (skip connections) in ResNets: gradient flows through two paths and adds
Attention mechanisms reuse the same key/query/value matrices
Gradient accumulation for large effective batch sizes exploits this additive structure

Learning Resources

What is backpropagation really doing?

3Blue1Brown

Illustrates gradient accumulation when multiple paths connect a parameter to the output.

14 min

Backpropagation calculus | Deep Learning chapter 4

3Blue1Brown

Derives the sum-of-paths rule from the chain rule explicitly.

11 min

Quiz

Question 1

Splitter sends $t_1$ to nodes $A$ and $B$ . Node $A$ sends back gradient 3; node $B$ sends back 7. Gradient on $t_1$ ?

Question 2

Why does the splitter add (not multiply) incoming gradients?

Question 3

A splitter with $k$ outputs must collect all $k$ upstream gradients before it can send a gradient backward.

Question 4

A residual block computes $y = F(x) + x$ . Using the splitter rule, $\frac{\partial y}{\partial x}$ is:

Question 5

If $t_1$ feeds a single downstream node (no splitter), the function node rule applies directly to $t_1$ .

Common Mistakes

Swapping the rules: adding at function nodes (should multiply) and multiplying at splitters (should add).
Forgetting the splitter must wait for ALL downstream gradients before computing its sum.
Taking the average of incoming gradients rather than the sum.