Experiment exp_pebble_game: Pebble Game Optimizer for Sparse Parity¶

Date: 2026-03-06 Status: SUCCESS Answers: Can execution reordering (pebble game) reduce energy cost without changing the learning algorithm (SGD)?

Hypothesis¶

By modeling one training step as a pebble game on a computation DAG and optimizing the topological execution order, we can reduce total energy cost compared to the standard forward-then-backward ordering. Energy is estimated using a tiered memory model: register (5 pJ), L1 (20 pJ), L2 (100 pJ), HBM (640 pJ) per float access.

Config¶

Parameter	Value
n_bits	20
k_sparse	3
hidden	1000
lr	0.1
wd	0.01
max_epochs	30
n_train	500
n_test	200
seed	42
orderings sampled	5,758 unique
L1 capacity	16,384 floats (64 KB)
L2 capacity	65,536 floats (256 KB)

Results¶

Strategy	Energy (uJ)	vs Standard	Accuracy	ARD
Standard	11.00	baseline	98%	45,275
Fused	10.84	-1.5%	57% (BROKEN)	42,248
Per-layer	10.84	-1.5%	57% (BROKEN)	43,329
Optimal pebble	10.76	-2.2%	98%	42,034

Energy distribution over 5,758 sampled orderings:

Statistic	Energy (uJ)
Min	10.76
Max	11.08
Mean	10.92
Median	10.92
Std	0.10

Analysis¶

What worked¶

Optimal pebbling reduces energy by 2.2% while preserving accuracy: the best ordering interleaves small-tensor updates (b2, b1) between backward ops, improving cache residency for medium-sized tensors (W2, dW2, h)
Optimal ordering delays large-tensor materialization: dW1 (20,000 floats) and upd_W1 are pushed to the end, keeping the working set smaller during the middle operations
The optimal order also achieves the best ARD (42,034), validating that pebble game energy and ARD are correlated

What did not work¶

Fused and per-layer orderings destroy training accuracy (57% = random chance): these orderings update W2 before computing dh = W2^T @ dy, so the backward pass uses already-updated weights. This is a read-after-write hazard on mutable parameters that the basic DAG dependency model fails to capture
The improvement window is very narrow (3% between best and worst): because W1 (20,000 floats) always lands in L2 (capacity 65,536) regardless of ordering, and W1 accesses dominate total energy

Key insight: DAG dependencies are insufficient for mutable state¶

The pebble game DAG captures data-flow dependencies (which tensors must exist before an op runs) but does not capture read-after-write hazards on shared mutable tensors. The upd_W2 operation modifies W2 in place, but bwd_dh reads W2 for backpropagation. The DAG shows no dependency between them because they are on different branches. A correct model must add "anti-dependencies": if op A reads a tensor that op B writes (and A needs the pre-write value), A must precede B.

Energy breakdown by tier¶

Tier	Standard	Optimal	Notes
Register	0.00 uJ (0.0%)	0.00 uJ (0.0%)	Only scalar tensors (loss, dy, b2)
L1	0.30 uJ (2.7%)	0.36 uJ (3.3%)	h, h_pre, dh, dW2 when recently accessed
L2	10.70 uJ (97.3%)	10.40 uJ (96.7%)	W1 (20K floats) dominates
HBM	0.00 uJ (0.0%)	0.00 uJ (0.0%)	Total working set fits in L2

The optimal ordering shifts ~0.3 uJ from L2 to L1 by keeping medium-sized tensors (1,000 floats each) more recently accessed when they are needed.

Optimal ordering (the pebble schedule)¶

 1. fwd_linear1    (read W1, x, b1 -> write h_pre)
 2. fwd_relu       (read h_pre -> write h)
 3. fwd_linear2    (read W2, h, b2 -> write y_hat)
 4. fwd_loss       (read y_hat, y -> write loss)
 5. bwd_loss       (read loss, y_hat, y -> write dy)
 6. bwd_dW2        (read dy, h -> write dW2)
 7. bwd_db2        (read dy -> write db2)
 8. upd_b2         (read b2, db2 -> write b2_new)    <-- small update early
 9. bwd_dh         (read W2, dy -> write dh)          <-- W2 still original
10. bwd_relu       (read dh, h_pre -> write dh_pre)
11. upd_W2         (read W2, dW2 -> write W2_new)    <-- W2 update after bwd_dh
12. bwd_db1        (read dh_pre -> write db1)
13. upd_b1         (read b1, db1 -> write b1_new)
14. bwd_dW1        (read dh_pre, x -> write dW1)     <-- large tensor last
15. upd_W1         (read W1, dW1 -> write W1_new)    <-- largest op last

The key moves vs standard ordering: (a) upd_b2 is pulled forward to step 8, (b) upd_W2 is delayed until after bwd_dh uses W2, (c) bwd_dW1 and upd_W1 (the two 20K-float operations) are pushed to the very end.

Open Questions¶

With larger models (deeper networks), does the improvement from pebble game optimization grow? The working set may exceed L2, putting some tensors in HBM where the cost difference is 6.4x
Can anti-dependencies (read-after-write on mutable params) be encoded into the DAG automatically to prevent invalid orderings like fused/perlayer?
Would activation checkpointing (recomputation) trade energy for memory and open up more optimization opportunities?
For batch training (multiple samples), the pebble game becomes more complex -- does the optimal ordering change?

Files¶

Experiment: src/sparse_parity/experiments/exp_pebble_game.py
Results: results/exp_pebble_game/results.json