Experiment: Push SGD Under 10ms on Sparse Parity¶
Date: 2026-03-14 Status: LOSS (SGD cannot reach 10ms on n=20/k=3) Issue: #4
Hypothesis¶
SGD can solve n=20/k=3 sparse parity in under 10ms by shrinking the model (hidden=32-64), reducing training samples, or using aggressive learning rates.
Results¶
Hidden size sweep (lr=0.5, max_epochs=50)¶
| Hidden | n_train | Accuracy | Time | Epochs |
|---|---|---|---|---|
| 64 | 200 | 52% | 108ms | 50 (failed) |
| 64 | 500 | 100% | 113ms | 21 |
| 64 | 1000 | 100% | 95ms | 8 |
| 100 | 500 | 100% | 71ms | 18 |
| 100 | 1000 | 100% | 68ms | 9 |
| 200 | 500 | 100% | 73ms | 19 |
| 200 | 1000 | 100% | 72ms | 7 |
Learning rate sweep (hidden=200, n_train=1000)¶
| LR | Accuracy | Time | Epochs |
|---|---|---|---|
| 0.1 | 100% | 142ms | ~40 |
| 0.2 | 100% | 104ms | 19 |
| 0.5 | 100% | 94ms | 7 |
| 1.0 | 100% | 74ms | 8 |
| 2.0 | 100% | 78ms | 11 |
Other attempts¶
| Config | Accuracy | Time | Notes |
|---|---|---|---|
| hidden=32, n_train=200 | 51% | 193ms | Too small, can't learn |
| batch=128 | 99.5% | 349ms | Larger batch hurts |
| batch=1000 (full) | 53.5% | 129ms | Full batch diverges |
| n=10/k=3 | 100% | 58ms | Easier problem, still slow |
Finding¶
SGD cannot solve n=20/k=3 sparse parity in under 60ms with any configuration.
The floor is approximately 7 grokking epochs at 8-12ms per epoch. The phase transition (where the network suddenly learns the parity function) requires a minimum number of weight updates that cannot be reduced by: - Shrinking the model (hidden=64 needs more epochs, cancels out) - Increasing learning rate (LR=0.5-1.0 reduces to 7-8 epochs, can't go lower) - Reducing training samples (n_train=200 fails entirely for small models) - Larger batch size (hurts or fails)
Why this matters¶
LeCun's Spark 7 experiments worked because digit recognition has first-order structure. Each pixel contributes independently to the output. SGD can learn this quickly.
Parity has zero first-order structure. The signal exists only in the k-th order interaction of the secret bits. SGD must discover this through grokking, which takes a minimum number of gradient steps regardless of model configuration.
The 10ms budget ("runnable on 1980s hardware in 1 hour") correctly filters SGD out of the viable methods for sparse parity. The algebraic methods (GF(2) at 0.5ms, KM at 1-4ms) fit. SGD doesn't. This is not a failure of SGD optimization. It is a structural mismatch between the algorithm and the problem.
Best SGD config found¶
hidden=200, n_train=1000, lr=0.5, batch=32: 100% accuracy in 7 epochs / 72ms. This is 2x faster than the default config (142ms) but still 7x over the 10ms target.