Linear Algebra
15.36 min read

The Forward Pass

The forward pass is simple: plug in t0\mathbf{t}_0, follow data left to right, compute f0:=f(t0)f_0 := f(\mathbf{t}_0).

But there is a crucial additional step: at each function node xx, store the input values that arrived. If node xx received values s1,0,,sk,0s_{1,0}, \ldots, s_{k,0}, cache those as s0\mathbf{s}_0.

Why? The backward pass needs to compute local derivatives like x(s)si(s0)\frac{\partial x(\mathbf{s})}{\partial s_i}(\mathbf{s}_0). These depend on the actual input values at t0\mathbf{t}_0. Without caching them, you would have to re-run part of the forward pass mid-backward-pass.

Think of it like a long calculation on paper: circle intermediate results as you go, because you might need them later.

Formal View

Algorithm 15.1 — Forward Pass
1. Initialize input wires to t0\mathbf{t}_0. 2. Process nodes left to right. For each function node xx: compute x0=x(s0)x_0 = x(\mathbf{s}_0) and store $\mathbf{s}_0$ at that node. 3. For splitter nodes: copy input to outputs (no storage needed). 4. The final output wire holds f0=f(t0)f_0 = f(\mathbf{t}_0).

Why This Matters

The forward pass is essentially free — just evaluating ff. Storing intermediate values is the small price for efficient backpropagation.

  • Neural network inference (forward pass only — no storage needed)
  • Training requires storing activations for the backward pass
  • Gradient checkpointing: store fewer activations and recompute selectively to save memory
  • Stored activations are also useful for debugging training pathologies

Quiz

Question 1

What does the forward pass store at each function node?

Question 2

For node xx computing x=s1s2x = s_1 \cdot s_2, why store s1,0s_{1,0} and s2,0s_{2,0} (not just x0x_0)?

Question 3

Storing intermediate values during the forward pass is optional — they can always be recomputed on demand.

Question 4

The forward pass processes nodes in which order?

Common Mistakes

  • Storing only the output x0x_0 and not the inputs s0\mathbf{s}_0 — you need the inputs for local gradient computation.
  • Running the forward pass right-to-left (that is the backward direction).
  • Thinking you store symbolic expressions rather than numerical values.