Skip to content

Experiment exp_egd: Egalitarian Gradient Descent

Date: 2026-03-16 Status: PARTIAL Answers: TODO.md "SGD Under 10ms" / EGD hypothesis

Hypothesis

If we replace gradient singular values with 1 (EGD, arXiv:2510.04930), then the grokking plateau shortens because all gradient directions evolve at equal speed, and we can push toward sub-10ms wall time.

Config

Parameter SGD Baseline EGD
n_bits 20 20
k_sparse 3 3
hidden 200 200
lr 0.1 0.1
wd 0.01 0.01
batch_size 32 32
max_epochs 200 200
n_train 1000 1000
seeds 42-46 42-46

Results

Part 1: Grokking Elimination (CPU, numpy)

EGD at lr=0.1 is the clear winner. Lower lr values slow convergence.

Method Avg Acc Ep to 90% Ep to Solve Time (CPU) Solved
SGD (lr=0.1) 100% 33 40 0.112s 5/5
EGD (lr=0.1) 100% 14 21 0.297s 5/5
EGD (lr=0.05) 99.8% 27 38 0.552s 5/5
EGD (lr=0.01) 96.8% 129 --- 1.280s 4/5
EGD (lr=0.005) 75.6% 120 --- 1.529s 1/5

EGD at lr=0.1 reaches 90% in 14 epochs vs SGD's 33 (2.4x faster in epochs). Solves in 21 vs 40 (1.9x faster).

CPU wall time is 2.7x worse due to SVD overhead (0.12ms per SVD on 200x20 matrix, ~31 batches/epoch = 3.7ms/epoch overhead).

Part 2: Sub-10ms Push (CPU, small configs)

Small hidden (50) fails for both EGD and SGD at n_train=200. At n_train=500 with hidden=50, SGD solves 4/5 seeds (avg 144 epochs, 0.166s) while EGD solves 2/5 (avg 136 epochs, 0.393s). Neither approach breaks 10ms.

The capacity/data bottleneck at small sizes dominates -- optimizer choice matters less than having enough hidden units and training data.

Part 3: GPU Timing (Modal L4)

Config Method Avg ms Ep to 90% Ep to Solve Solved
Parity h=200 SGD 1,068 36 42 5/5
Parity h=200 EGD 1,207 14 19 5/5
Parity h=50 SGD 2,012 139 164 5/5
Parity h=50 EGD 3,140 69 106 5/5

On GPU, EGD is 12% slower in wall time despite 2x fewer epochs (parity h=200). SVD overhead per batch is not amortized by the epoch reduction.

For h=50, EGD is 56% slower -- smaller matrices make SVD overhead proportionally larger.

Part 4: Sparse Sum (GPU)

Config Method Avg ms Ep to 90% Ep to Solve Solved
Sum h=200 SGD 3,853 --- --- 0/5
Sum h=200 EGD 1,484 10 24 5/5

SGD at lr=0.1 diverges on sum (MSE loss, targets in [-3, 3]). The gradient magnitude from MSE scales with target range, making lr=0.1 too aggressive.

EGD handles this because SVD normalization removes gradient magnitude information. All directions get unit-norm updates regardless of target scale.

Note: SGD would solve sum with a lower lr (e.g., 0.01) or target normalization. The failure is a hyperparameter issue, not a fundamental limitation. But EGD's robustness to learning rate / target scale is a real property.

Key Table

Metric SGD EGD Ratio
Parity: epochs to 90% (GPU) 36 14 2.6x fewer
Parity: epochs to solve (GPU) 42 19 2.2x fewer
Parity: GPU wall time (h=200) 1,068ms 1,207ms 0.88x (slower)
Sum: solved (GPU, lr=0.1) 0/5 5/5 EGD robust

Analysis

What worked

  • EGD shortens the grokking plateau by 2-2.5x on parity. Equalizing gradient singular values helps the network find the sparse feature directions faster.
  • EGD is robust to gradient scale. On sum, where MSE gradients are large, EGD's normalization prevents divergence that kills SGD at the same lr.
  • lr=0.1 works for both SGD and EGD on parity (same lr, no tuning needed).

What didn't work

  • Sub-10ms is not achievable. SVD overhead per batch (~0.12ms CPU, similar GPU) outweighs the epoch savings. Even with 2x fewer epochs, wall time is 12% worse.
  • Small hidden (50) is capacity-limited. Both methods struggle. The optimizer cannot compensate for insufficient model capacity.
  • Lower EGD learning rates (0.01, 0.005) are too slow -- normalizing gradients to unit norm means the lr directly controls step size, and small lr means slow convergence.

Surprise

SGD fails on sparse sum at lr=0.1 (0/5 seeds) while EGD solves it (5/5). EGD's gradient normalization makes it robust to loss/target scale, independent of its convergence speed benefit.

Open Questions (for next experiment)

  • Does EGD + curriculum learning compound? Curriculum gave 14.6x on n=50, EGD gives 2x on convergence. If multiplicative, that's ~30x.
  • Can a cheaper approximation to SVD (e.g., power iteration for top singular vectors only) reduce the overhead while keeping the convergence benefit?
  • What happens with EGD on k=5? The grokking plateau is much longer there.
  • SGD on sum with lr=0.01: does it then match EGD's convergence speed?

Files

  • CPU experiment: src/sparse_parity/experiments/exp_egd.py
  • GPU experiment: bin/gpu_egd.py
  • CPU results: results/exp_egd/results.json
  • GPU results: results/exp_egd/gpu_results.json