Skip to content

Experiment sign_sgd: Sign SGD for k=5 Sparse Parity

Date: 2026-03-04 Status: SUCCESS (with caveats) Answers: Open Question #1, "Can Sign SGD solve k=5?"

Hypothesis

If we replace gradient descent with sign(gradient) descent, then k=5 sparse parity will become solvable because Sign SGD needs n^{k-1} samples instead of n^k (Kou et al. 2024).

Config

Parameter Value
n_bits 20 (also 30)
k_sparse 3 and 5
hidden 200 (k=3), 500 (k=5)
lr 0.01 (sign), 0.1 (standard)
wd 0.01
batch_size 32
max_epochs 200-500
n_train 1000-50000
seed 42, 43, 44
method sign-sgd vs standard

Results

Metric Value
Best test accuracy (sign, k=5, 5K) 100%
Epochs to >90% (sign, k=5, 5K) 7 (vs 14 standard)
Best test accuracy (std, k=5, 5K) 100%
Wall time (sign, k=5, 5K) 0.44s
Wall time (std, k=5, 5K) 0.81s

Summary Table

Config Method n_train Best Acc Ep->90% Time(s) Solved
n=20,k=3 standard 1000 100% 34 0.68 3/3
n=20,k=3 sign (lr=0.01) 1000 99.7% 26 0.42 3/3
n=20,k=5 standard 5000 100% 14 0.81 3/3
n=20,k=5 sign (lr=0.01) 5000 100% 7 0.44 3/3
n=20,k=5 sign (lr=0.01) 20000 100% 2 0.81 3/3
n=20,k=5 sign (lr=0.001) 5000 98.1% 145 14.06 3/3
n=30,k=3 standard 2000 100% 22 0.18 3/3
n=30,k=3 sign (lr=0.01) 2000 100% 15 0.36 3/3

Analysis

What worked

  • Sign SGD solves k=5 reliably (100% across 3 seeds)
  • Sign SGD converges 2x faster to 90% on k=5 (7 vs 14 epochs with n_train=5000)
  • With 20K samples, Sign SGD reaches 90% in just 2 epochs
  • For k=3, Sign SGD reaches 90% slightly faster (26 vs 34 epochs on n=20; 15 vs 22 on n=30)

What didn't work

  • Sign SGD with lr=0.001 is very slow (145 epochs to 90% on k=5 with 5K samples)
  • Sign SGD oscillates near the top on k=3, reaches 99% but struggles to hit 100% (due to fixed step size)
  • lr=0.001 on k=3 fails entirely (78% max in 200 epochs)

Surprise

Standard SGD also solves k=5 with enough data. The earlier exp_d finding that "k=5 is impractical" was wrong: it used n_train=200 (= 10*n_bits). With n_train=5000, standard SGD solves n=20/k=5 at 100% in 14 epochs. The bottleneck was training data, not the optimization algorithm.

This means the n^k sample complexity bound is pessimistic in practice. With n_train=5000 << n^k=3,200,000, standard SGD still solves it.

Sign SGD's advantage is convergence speed (2x fewer epochs to 90%) and data efficiency (reaches 90% in 2 epochs with 20K samples vs much more with standard SGD at lower lr).

Open Questions (for next experiment)

  • What is the actual minimum n_train for k=5 with standard SGD vs sign SGD? Binary search to find the threshold.
  • Does Sign SGD help at k=7 or k=9 where the gap should be larger?
  • Can learning rate warmup or decay fix Sign SGD's oscillation problem near 100%?
  • What does Sign SGD's ARD look like? The sign() operation itself is cheap, but does the convergence speed offset help energy?

Files

  • Experiment: src/sparse_parity/experiments/exp_sign_sgd.py
  • Results: results/exp_sign_sgd/results.json