Experiment exp_pebble_game: Pebble Game Optimizer for Sparse Parity¶
Date: 2026-03-06 Status: SUCCESS Answers: Can execution reordering (pebble game) reduce energy cost without changing the learning algorithm (SGD)?
Hypothesis¶
By modeling one training step as a pebble game on a computation DAG and optimizing the topological execution order, we can reduce total energy cost compared to the standard forward-then-backward ordering. Energy is estimated using a tiered memory model: register (5 pJ), L1 (20 pJ), L2 (100 pJ), HBM (640 pJ) per float access.
Config¶
| Parameter | Value |
|---|---|
| n_bits | 20 |
| k_sparse | 3 |
| hidden | 1000 |
| lr | 0.1 |
| wd | 0.01 |
| max_epochs | 30 |
| n_train | 500 |
| n_test | 200 |
| seed | 42 |
| orderings sampled | 5,758 unique |
| L1 capacity | 16,384 floats (64 KB) |
| L2 capacity | 65,536 floats (256 KB) |
Results¶
| Strategy | Energy (uJ) | vs Standard | Accuracy | ARD |
|---|---|---|---|---|
| Standard | 11.00 | baseline | 98% | 45,275 |
| Fused | 10.84 | -1.5% | 57% (BROKEN) | 42,248 |
| Per-layer | 10.84 | -1.5% | 57% (BROKEN) | 43,329 |
| Optimal pebble | 10.76 | -2.2% | 98% | 42,034 |
Energy distribution over 5,758 sampled orderings:
| Statistic | Energy (uJ) |
|---|---|
| Min | 10.76 |
| Max | 11.08 |
| Mean | 10.92 |
| Median | 10.92 |
| Std | 0.10 |
Analysis¶
What worked¶
- Optimal pebbling reduces energy by 2.2% while preserving accuracy: the best ordering interleaves small-tensor updates (b2, b1) between backward ops, improving cache residency for medium-sized tensors (W2, dW2, h)
- Optimal ordering delays large-tensor materialization: dW1 (20,000 floats) and upd_W1 are pushed to the end, keeping the working set smaller during the middle operations
- The optimal order also achieves the best ARD (42,034), validating that pebble game energy and ARD are correlated
What did not work¶
- Fused and per-layer orderings destroy training accuracy (57% = random chance): these orderings update W2 before computing
dh = W2^T @ dy, so the backward pass uses already-updated weights. This is a read-after-write hazard on mutable parameters that the basic DAG dependency model fails to capture - The improvement window is very narrow (3% between best and worst): because W1 (20,000 floats) always lands in L2 (capacity 65,536) regardless of ordering, and W1 accesses dominate total energy
Key insight: DAG dependencies are insufficient for mutable state¶
The pebble game DAG captures data-flow dependencies (which tensors must exist before an op runs) but does not capture read-after-write hazards on shared mutable tensors. The upd_W2 operation modifies W2 in place, but bwd_dh reads W2 for backpropagation. The DAG shows no dependency between them because they are on different branches. A correct model must add "anti-dependencies": if op A reads a tensor that op B writes (and A needs the pre-write value), A must precede B.
Energy breakdown by tier¶
| Tier | Standard | Optimal | Notes |
|---|---|---|---|
| Register | 0.00 uJ (0.0%) | 0.00 uJ (0.0%) | Only scalar tensors (loss, dy, b2) |
| L1 | 0.30 uJ (2.7%) | 0.36 uJ (3.3%) | h, h_pre, dh, dW2 when recently accessed |
| L2 | 10.70 uJ (97.3%) | 10.40 uJ (96.7%) | W1 (20K floats) dominates |
| HBM | 0.00 uJ (0.0%) | 0.00 uJ (0.0%) | Total working set fits in L2 |
The optimal ordering shifts ~0.3 uJ from L2 to L1 by keeping medium-sized tensors (1,000 floats each) more recently accessed when they are needed.
Optimal ordering (the pebble schedule)¶
1. fwd_linear1 (read W1, x, b1 -> write h_pre)
2. fwd_relu (read h_pre -> write h)
3. fwd_linear2 (read W2, h, b2 -> write y_hat)
4. fwd_loss (read y_hat, y -> write loss)
5. bwd_loss (read loss, y_hat, y -> write dy)
6. bwd_dW2 (read dy, h -> write dW2)
7. bwd_db2 (read dy -> write db2)
8. upd_b2 (read b2, db2 -> write b2_new) <-- small update early
9. bwd_dh (read W2, dy -> write dh) <-- W2 still original
10. bwd_relu (read dh, h_pre -> write dh_pre)
11. upd_W2 (read W2, dW2 -> write W2_new) <-- W2 update after bwd_dh
12. bwd_db1 (read dh_pre -> write db1)
13. upd_b1 (read b1, db1 -> write b1_new)
14. bwd_dW1 (read dh_pre, x -> write dW1) <-- large tensor last
15. upd_W1 (read W1, dW1 -> write W1_new) <-- largest op last
The key moves vs standard ordering: (a) upd_b2 is pulled forward to step 8, (b) upd_W2 is delayed until after bwd_dh uses W2, (c) bwd_dW1 and upd_W1 (the two 20K-float operations) are pushed to the very end.
Open Questions¶
- With larger models (deeper networks), does the improvement from pebble game optimization grow? The working set may exceed L2, putting some tensors in HBM where the cost difference is 6.4x
- Can anti-dependencies (read-after-write on mutable params) be encoded into the DAG automatically to prevent invalid orderings like fused/perlayer?
- Would activation checkpointing (recomputation) trade energy for memory and open up more optimization opportunities?
- For batch training (multiple samples), the pebble game becomes more complex -- does the optimal ordering change?
Files¶
- Experiment:
src/sparse_parity/experiments/exp_pebble_game.py - Results:
results/exp_pebble_game/results.json