15.510 min read

The Function Node Rule

Here is the core of backpropagation: how does a function node pass gradient information backward?

Node $x$ computes $x(\mathbf{s})$ from inputs $s_1, \ldots, s_k$ . During the backward pass, it receives a number along its output wire — the "upstream gradient" $\frac{\partial f(x, \mathbf{t})}{\partial x}(x_0, \mathbf{t}_0)$ . Its job: produce a gradient for each input wire $s_i$ .

The answer is a product: upstream gradient × local gradient. The local gradient for input $s_i$ is $\frac{\partial x(\mathbf{s})}{\partial s_i}(\mathbf{s}_0)$ , computable from the stored $\mathbf{s}_0$ . The node multiplies and sends the result backward along wire $s_i$ .

This is just the chain rule, applied one node at a time. Each node does a tiny local computation — no node needs to know the structure of the rest of the circuit.

Formal View

Lemma 15.1 — Function Node Lemma

Let node

x

have inputs

\mathbf{s}

and compute

x(\mathbf{s})

. Then:

\frac{\partial f(s_i, \mathbf{t})}{\partial s_i}([s_i]_0, \mathbf{t}_0) = \underbrace{\frac{\partial f(x, \mathbf{t})}{\partial x}(x_0, \mathbf{t}_0)}_{\text{upstream gradient}} \cdot \underbrace{\frac{\partial x(\mathbf{s})}{\partial s_i}(\mathbf{s}_0)}_{\text{local gradient}}

One application of the multivariate chain rule. The product has exactly two factors — upstream (from the right) and local (computed from stored $\mathbf{s}_0$ ).

Interactive Visualization

Backpropagation Circuit

Why This Matters

Every PyTorch operation (matmul, ReLU, softmax, etc.) implements exactly this rule in its .backward() method.

ReLU backward: local gradient is 1 if input > 0, else 0
Multiply node: gradient to each input = other input × upstream gradient
Sigmoid backward: local gradient is $\sigma(1 - \sigma)$ , evaluated at the stored input
Matrix multiply backward: gradient = upstream × transpose of the other matrix factor

Learning Resources

Backpropagation, step by step

StatQuest with Josh Starmer

Walks through the function node rule with explicit numerical examples.

18 min

The spelled-out intro to neural networks and backpropagation: building micrograd

Andrej Karpathy

Derives the backward rule for add, multiply, tanh, and other operations from scratch.

144 min

Quiz

Question 1

Node $x = s_1 \cdot s_2$ receives upstream gradient $\delta = 5$ . Stored: $s_{1,0} = 3$ , $s_{2,0} = 4$ . What gradient goes on wire $s_1$ ?

Question 2

For node $x = \sin(s)$ , stored $s_0 = \pi/2$ , upstream gradient $\delta = 2$ . What gradient goes on wire $s$ ?

Question 3

What two values does a function node multiply together in the backward pass?

Question 4

A function node with two inputs sends the same gradient to both input wires.

Question 5

Why does the function node need the stored $\mathbf{s}_0$ ?

Common Mistakes

Multiplying by the forward value $x_0$ instead of by the local derivative $\frac{\partial x}{\partial s_i}$ .
Sending the same gradient to all input wires — each gets a different local partial.
Confusing the upstream gradient (arriving from the right) with the gradient being sent leftward (what the node produces).