15.36 min read

The Forward Pass

The forward pass is simple: plug in $\mathbf{t}_0$ , follow data left to right, compute $f_0 := f(\mathbf{t}_0)$ .

But there is a crucial additional step: at each function node $x$ , store the input values that arrived. If node $x$ received values $s_{1,0}, \ldots, s_{k,0}$ , cache those as $\mathbf{s}_0$ .

Why? The backward pass needs to compute local derivatives like $\frac{\partial x(\mathbf{s})}{\partial s_i}(\mathbf{s}_0)$ . These depend on the actual input values at $\mathbf{t}_0$ . Without caching them, you would have to re-run part of the forward pass mid-backward-pass.

Think of it like a long calculation on paper: circle intermediate results as you go, because you might need them later.

Formal View

Algorithm 15.1 — Forward Pass

1. Initialize input wires to

\mathbf{t}_0

. 2. Process nodes left to right. For each function node

x

: compute

x_0 = x(\mathbf{s}_0)

and store $\mathbf{s}_0$ at that node. 3. For splitter nodes: copy input to outputs (no storage needed). 4. The final output wire holds

f_0 = f(\mathbf{t}_0)

Why This Matters

The forward pass is essentially free — just evaluating $f$ . Storing intermediate values is the small price for efficient backpropagation.

Neural network inference (forward pass only — no storage needed)
Training requires storing activations for the backward pass
Gradient checkpointing: store fewer activations and recompute selectively to save memory
Stored activations are also useful for debugging training pathologies

Learning Resources

Forward and backward passes explained

deeplizard

Clear explanation of forward and backward passes with concrete neural network examples.

10 min

The spelled-out intro to neural networks and backpropagation: building micrograd

Andrej Karpathy

Stores forward-pass data as attributes on value objects, exactly as the theory prescribes.

144 min

Quiz

Question 1

What does the forward pass store at each function node?

Question 2

For node $x$ computing $x = s_1 \cdot s_2$ , why store $s_{1,0}$ and $s_{2,0}$ (not just $x_0$ )?

Question 3

Storing intermediate values during the forward pass is optional — they can always be recomputed on demand.

Question 4

The forward pass processes nodes in which order?

Common Mistakes

Storing only the output $x_0$ and not the inputs $\mathbf{s}_0$ — you need the inputs for local gradient computation.
Running the forward pass right-to-left (that is the backward direction).
Thinking you store symbolic expressions rather than numerical values.