Skip to content

Experiment: Push SGD Under 10ms on Sparse Parity

Date: 2026-03-14 Status: LOSS (SGD cannot reach 10ms on n=20/k=3) Issue: #4

Hypothesis

SGD can solve n=20/k=3 sparse parity in under 10ms by shrinking the model (hidden=32-64), reducing training samples, or using aggressive learning rates.

Results

Hidden size sweep (lr=0.5, max_epochs=50)

Hidden n_train Accuracy Time Epochs
64 200 52% 108ms 50 (failed)
64 500 100% 113ms 21
64 1000 100% 95ms 8
100 500 100% 71ms 18
100 1000 100% 68ms 9
200 500 100% 73ms 19
200 1000 100% 72ms 7

Learning rate sweep (hidden=200, n_train=1000)

LR Accuracy Time Epochs
0.1 100% 142ms ~40
0.2 100% 104ms 19
0.5 100% 94ms 7
1.0 100% 74ms 8
2.0 100% 78ms 11

Other attempts

Config Accuracy Time Notes
hidden=32, n_train=200 51% 193ms Too small, can't learn
batch=128 99.5% 349ms Larger batch hurts
batch=1000 (full) 53.5% 129ms Full batch diverges
n=10/k=3 100% 58ms Easier problem, still slow

Finding

SGD cannot solve n=20/k=3 sparse parity in under 60ms with any configuration.

The floor is approximately 7 grokking epochs at 8-12ms per epoch. The phase transition (where the network suddenly learns the parity function) requires a minimum number of weight updates that cannot be reduced by: - Shrinking the model (hidden=64 needs more epochs, cancels out) - Increasing learning rate (LR=0.5-1.0 reduces to 7-8 epochs, can't go lower) - Reducing training samples (n_train=200 fails entirely for small models) - Larger batch size (hurts or fails)

Why this matters

LeCun's Spark 7 experiments worked because digit recognition has first-order structure. Each pixel contributes independently to the output. SGD can learn this quickly.

Parity has zero first-order structure. The signal exists only in the k-th order interaction of the secret bits. SGD must discover this through grokking, which takes a minimum number of gradient steps regardless of model configuration.

The 10ms budget ("runnable on 1980s hardware in 1 hour") correctly filters SGD out of the viable methods for sparse parity. The algebraic methods (GF(2) at 0.5ms, KM at 1-4ms) fit. SGD doesn't. This is not a failure of SGD optimization. It is a structural mismatch between the algorithm and the problem.

Best SGD config found

hidden=200, n_train=1000, lr=0.5, batch=32: 100% accuracy in 7 epochs / 72ms. This is 2x faster than the default config (142ms) but still 7x over the 10ms target.