Experiment exp_egd: Egalitarian Gradient Descent¶

Date: 2026-03-16 Status: PARTIAL Answers: TODO.md "SGD Under 10ms" / EGD hypothesis

Hypothesis¶

If we replace gradient singular values with 1 (EGD, arXiv:2510.04930), then the grokking plateau shortens because all gradient directions evolve at equal speed, and we can push toward sub-10ms wall time.

Config¶

Parameter	SGD Baseline	EGD
n_bits	20	20
k_sparse	3	3
hidden	200	200
lr	0.1	0.1
wd	0.01	0.01
batch_size	32	32
max_epochs	200	200
n_train	1000	1000
seeds	42-46	42-46

Results¶

Part 1: Grokking Elimination (CPU, numpy)¶

EGD at lr=0.1 is the clear winner. Lower lr values slow convergence.

Method	Avg Acc	Ep to 90%	Ep to Solve	Time (CPU)	Solved
SGD (lr=0.1)	100%	33	40	0.112s	5/5
EGD (lr=0.1)	100%	14	21	0.297s	5/5
EGD (lr=0.05)	99.8%	27	38	0.552s	5/5
EGD (lr=0.01)	96.8%	129	---	1.280s	4/5
EGD (lr=0.005)	75.6%	120	---	1.529s	1/5

EGD at lr=0.1 reaches 90% in 14 epochs vs SGD's 33 (2.4x faster in epochs). Solves in 21 vs 40 (1.9x faster).

CPU wall time is 2.7x worse due to SVD overhead (0.12ms per SVD on 200x20 matrix, ~31 batches/epoch = 3.7ms/epoch overhead).

Part 2: Sub-10ms Push (CPU, small configs)¶

Small hidden (50) fails for both EGD and SGD at n_train=200. At n_train=500 with hidden=50, SGD solves 4/5 seeds (avg 144 epochs, 0.166s) while EGD solves 2/5 (avg 136 epochs, 0.393s). Neither approach breaks 10ms.

The capacity/data bottleneck at small sizes dominates -- optimizer choice matters less than having enough hidden units and training data.

Config	Method	Avg ms	Ep to 90%	Ep to Solve	Solved
Parity h=200	SGD	1,068	36	42	5/5
Parity h=200	EGD	1,207	14	19	5/5
Parity h=50	SGD	2,012	139	164	5/5
Parity h=50	EGD	3,140	69	106	5/5

On GPU, EGD is 12% slower in wall time despite 2x fewer epochs (parity h=200). SVD overhead per batch is not amortized by the epoch reduction.

For h=50, EGD is 56% slower -- smaller matrices make SVD overhead proportionally larger.

Part 4: Sparse Sum (GPU)¶

Config	Method	Avg ms	Ep to 90%	Ep to Solve	Solved
Sum h=200	SGD	3,853	---	---	0/5
Sum h=200	EGD	1,484	10	24	5/5

SGD at lr=0.1 diverges on sum (MSE loss, targets in [-3, 3]). The gradient magnitude from MSE scales with target range, making lr=0.1 too aggressive.

EGD handles this because SVD normalization removes gradient magnitude information. All directions get unit-norm updates regardless of target scale.

Note: SGD would solve sum with a lower lr (e.g., 0.01) or target normalization. The failure is a hyperparameter issue, not a fundamental limitation. But EGD's robustness to learning rate / target scale is a real property.

Key Table¶

Metric	SGD	EGD	Ratio
Parity: epochs to 90% (GPU)	36	14	2.6x fewer
Parity: epochs to solve (GPU)	42	19	2.2x fewer
Parity: GPU wall time (h=200)	1,068ms	1,207ms	0.88x (slower)
Sum: solved (GPU, lr=0.1)	0/5	5/5	EGD robust

Analysis¶

What worked¶

EGD shortens the grokking plateau by 2-2.5x on parity. Equalizing gradient singular values helps the network find the sparse feature directions faster.
EGD is robust to gradient scale. On sum, where MSE gradients are large, EGD's normalization prevents divergence that kills SGD at the same lr.
lr=0.1 works for both SGD and EGD on parity (same lr, no tuning needed).

What didn't work¶

Sub-10ms is not achievable. SVD overhead per batch (~0.12ms CPU, similar GPU) outweighs the epoch savings. Even with 2x fewer epochs, wall time is 12% worse.
Small hidden (50) is capacity-limited. Both methods struggle. The optimizer cannot compensate for insufficient model capacity.
Lower EGD learning rates (0.01, 0.005) are too slow -- normalizing gradients to unit norm means the lr directly controls step size, and small lr means slow convergence.

Surprise¶

SGD fails on sparse sum at lr=0.1 (0/5 seeds) while EGD solves it (5/5). EGD's gradient normalization makes it robust to loss/target scale, independent of its convergence speed benefit.

Open Questions (for next experiment)¶

Does EGD + curriculum learning compound? Curriculum gave 14.6x on n=50, EGD gives 2x on convergence. If multiplicative, that's ~30x.
Can a cheaper approximation to SVD (e.g., power iteration for top singular vectors only) reduce the overhead while keeping the convergence benefit?
What happens with EGD on k=5? The grokking plateau is much longer there.
SGD on sum with lr=0.01: does it then match EGD's convergence speed?

Files¶

CPU experiment: src/sparse_parity/experiments/exp_egd.py
GPU experiment: bin/gpu_egd.py
CPU results: results/exp_egd/results.json
GPU results: results/exp_egd/gpu_results.json