Experiment C: Per-Layer Forward-Backward on 20-bit Sparse Parity¶
Date: 2026-03-04 Status: COMPLETE Key question: Does per-layer forward-backward converge on 20-bit (k=3)? What ARD improvement vs standard backprop?
Setup¶
| Parameter | Value |
|---|---|
| n_bits | 20 |
| k_sparse | 3 |
| hidden | 1000 |
| LR | 0.1 |
| WD | 0.01 |
| n_train | 500 |
| n_test | 200 |
| Training | Single-sample SGD |
| Max epochs | 200 |
| Seed | 42 |
Secret indices: [0, 3, 8]
Results¶
| Method | Best Test Acc | Solve Epoch (>90%) | Wall Time | Weighted ARD |
|---|---|---|---|---|
| Standard backprop | 99.5% | 6 | 19.1s | 35,920 |
| Per-layer fwd-bwd | 99.5% | 6 | 19.2s | 34,564 |
ARD improvement: 3.8% (per-layer vs standard)
Findings¶
-
Per-layer converges on 20-bit. Both methods reach 99.5% test accuracy by epoch 9, with >90% at epoch 6. The convergence trajectories are identical at every epoch.
-
Identical convergence dynamics. Both methods produce the same train/test accuracy curves and weight movement norms at every epoch. With single-sample SGD, the per-layer reordering only affects parameter update order within each sample, and the mathematical effect is nearly identical at moderate learning rates.
-
ARD improvement is 3.8%, down from 9.1% on 3-bit. The per-layer method uses fewer reads (15 vs 24) and writes (13 vs 18) per training step. The improvement shrinks at scale because:
- The W1 buffer (hidden x n_bits = 20,000 floats) dominates total memory traffic
- Per-layer saves intermediate buffers (dW2, db2, dh, dh_pre, dout) by fusing backward + update
-
But these savings are a smaller fraction of total traffic when W1 is large
-
Why the ARD gap shrinks. At n_bits=3 (hidden=1000), W1 is 3,000 floats. Intermediate buffers (~4,000 floats for dh, dh_pre, dW2, etc.) are comparable in size to W1, so eliminating them matters. At n_bits=20, W1 is 20,000 floats, dwarfing the intermediate savings.
Implications¶
- Per-layer forward-backward is a safe optimization: it does not hurt convergence on the 20-bit task
- The 3.8% ARD improvement is real but modest, and the benefit diminishes as model size grows
- For larger models, bigger ARD wins require approaches that reduce W1 traffic (e.g., tiling, gradient accumulation in registers, or avoiding full W1 reads)
Files¶
- Experiment:
src/sparse_parity/experiments/exp_c_perlayer_20bit.py - Results:
results/exp_c_perlayer_20bit/results.json