Experiment WD_SWEEP: Weight Decay Sweep
Date: 2026-03-04
Status: SUCCESS
Answers: Open Question #5, "Does higher WD (0.1, 1.0) accelerate grokking on 20-bit?"
Hypothesis
If we increase weight decay beyond 0.01, grokking will accelerate because stronger regularization encourages simpler (sparser) solutions faster.
Config
| Parameter |
Value |
| n_bits |
20 |
| k_sparse |
3 |
| hidden |
200 |
| lr |
0.1 |
| wd |
sweep: 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0 |
| batch_size |
32 |
| max_epochs |
200 |
| n_train |
1000 |
| seeds |
42-46 (5 seeds per WD) |
| method |
standard (numpy SGD, hinge loss) |
Results
| Metric |
Value |
| Best WD |
0.01 (39 avg epochs, 100% success) |
| Runner-up WD |
0.05 (45.8 avg epochs, 100% success) |
| Working range |
[0.01, 0.05] only |
| Failure modes |
WD<0.01: no convergence in 200 epochs; WD>=0.1: regularization kills learning |
Summary Table
| WD |
Avg Epochs |
Avg Time (s) |
Success Rate |
| 0.001 |
FAIL |
0.314 |
0% |
| 0.01 |
39.0 |
0.108 |
100% |
| 0.05 |
45.8 |
0.124 |
100% |
| 0.1 |
FAIL |
1.124 |
0% |
| 0.5 |
FAIL |
0.834 |
0% |
| 1.0 |
FAIL |
0.660 |
0% |
| 2.0 |
FAIL |
0.744 |
0% |
Analysis
What worked
- WD=0.01 remains best: 100% success, fastest average (39 epochs, 0.108s)
- WD=0.05 also works: slightly slower (45.8 epochs) but reliable
- The working range [0.01, 0.05] is narrow but reliable
What didn't work
- WD=0.001 (too weak): weights grow unconstrained, no phase transition in 200 epochs
- WD>=0.1 (too strong): regularization penalty dominates the loss, weights get shrunk too aggressively for the network to learn the parity function
- No WD value beats the existing default of 0.01
Surprise
- The working WD range is narrow: only a 5x range (0.01-0.05) out of a 2000x sweep. WD is tightly coupled to LR. The effective regularization is LRWD, so the working range for LRWD is [0.001, 0.005]. WD=0.001 fails because LRWD=0.0001 is too weak; WD=0.1 fails because LRWD=0.01 is too strong.
- WD=0.001 failing is notable: without enough regularization, the phase transition doesn't happen within 200 epochs. Weight decay is required for grokking in this regime.
Open Questions (for next experiment)
- Does the working WD range shift with LR? e.g., LR=0.05 might work with WD=0.1 (keeping LR*WD=0.005)
- Does WD=0.001 eventually solve if given more epochs (500+)? Would confirm WD controls grokking speed, not capability
- Can a WD schedule (start high, decay) accelerate the phase transition?
Files
- Experiment:
src/sparse_parity/experiments/exp_wd_sweep.py
- Results:
results/exp_wd_sweep/results.json