Experiment sign_sgd: Sign SGD for k=5 Sparse Parity¶
Date: 2026-03-04 Status: SUCCESS (with caveats) Answers: Open Question #1, "Can Sign SGD solve k=5?"
Hypothesis¶
If we replace gradient descent with sign(gradient) descent, then k=5 sparse parity will become solvable because Sign SGD needs n^{k-1} samples instead of n^k (Kou et al. 2024).
Config¶
| Parameter | Value |
|---|---|
| n_bits | 20 (also 30) |
| k_sparse | 3 and 5 |
| hidden | 200 (k=3), 500 (k=5) |
| lr | 0.01 (sign), 0.1 (standard) |
| wd | 0.01 |
| batch_size | 32 |
| max_epochs | 200-500 |
| n_train | 1000-50000 |
| seed | 42, 43, 44 |
| method | sign-sgd vs standard |
Results¶
| Metric | Value |
|---|---|
| Best test accuracy (sign, k=5, 5K) | 100% |
| Epochs to >90% (sign, k=5, 5K) | 7 (vs 14 standard) |
| Best test accuracy (std, k=5, 5K) | 100% |
| Wall time (sign, k=5, 5K) | 0.44s |
| Wall time (std, k=5, 5K) | 0.81s |
Summary Table¶
| Config | Method | n_train | Best Acc | Ep->90% | Time(s) | Solved |
|---|---|---|---|---|---|---|
| n=20,k=3 | standard | 1000 | 100% | 34 | 0.68 | 3/3 |
| n=20,k=3 | sign (lr=0.01) | 1000 | 99.7% | 26 | 0.42 | 3/3 |
| n=20,k=5 | standard | 5000 | 100% | 14 | 0.81 | 3/3 |
| n=20,k=5 | sign (lr=0.01) | 5000 | 100% | 7 | 0.44 | 3/3 |
| n=20,k=5 | sign (lr=0.01) | 20000 | 100% | 2 | 0.81 | 3/3 |
| n=20,k=5 | sign (lr=0.001) | 5000 | 98.1% | 145 | 14.06 | 3/3 |
| n=30,k=3 | standard | 2000 | 100% | 22 | 0.18 | 3/3 |
| n=30,k=3 | sign (lr=0.01) | 2000 | 100% | 15 | 0.36 | 3/3 |
Analysis¶
What worked¶
- Sign SGD solves k=5 reliably (100% across 3 seeds)
- Sign SGD converges 2x faster to 90% on k=5 (7 vs 14 epochs with n_train=5000)
- With 20K samples, Sign SGD reaches 90% in just 2 epochs
- For k=3, Sign SGD reaches 90% slightly faster (26 vs 34 epochs on n=20; 15 vs 22 on n=30)
What didn't work¶
- Sign SGD with lr=0.001 is very slow (145 epochs to 90% on k=5 with 5K samples)
- Sign SGD oscillates near the top on k=3, reaches 99% but struggles to hit 100% (due to fixed step size)
- lr=0.001 on k=3 fails entirely (78% max in 200 epochs)
Surprise¶
Standard SGD also solves k=5 with enough data. The earlier exp_d finding that "k=5 is impractical" was wrong: it used n_train=200 (= 10*n_bits). With n_train=5000, standard SGD solves n=20/k=5 at 100% in 14 epochs. The bottleneck was training data, not the optimization algorithm.
This means the n^k sample complexity bound is pessimistic in practice. With n_train=5000 << n^k=3,200,000, standard SGD still solves it.
Sign SGD's advantage is convergence speed (2x fewer epochs to 90%) and data efficiency (reaches 90% in 2 epochs with 20K samples vs much more with standard SGD at lower lr).
Open Questions (for next experiment)¶
- What is the actual minimum n_train for k=5 with standard SGD vs sign SGD? Binary search to find the threshold.
- Does Sign SGD help at k=7 or k=9 where the gap should be larger?
- Can learning rate warmup or decay fix Sign SGD's oscillation problem near 100%?
- What does Sign SGD's ARD look like? The sign() operation itself is cheap, but does the convergence speed offset help energy?
Files¶
- Experiment:
src/sparse_parity/experiments/exp_sign_sgd.py - Results:
results/exp_sign_sgd/results.json