Experiment 1: Fix Hyperparameters (Barak et al. 2022)¶

Date: 2026-03-03 Status: SUCCESS -- 99.0% test accuracy on 20-bit sparse parity (k=3)

Hypothesis¶

Matching Barak et al. 2022's hyperparameters (LR=0.1, batch_size=32, more epochs) will trigger the phase transition on 20-bit sparse parity (k=3), breaking past the ~54% ceiling.

Result¶

Hypothesis confirmed. The model achieved 99.0% test accuracy at epoch 52 (832 SGD steps), solving 20-bit sparse parity cleanly.

Configuration¶

Parameter	Old (baseline)	New (this experiment)
n_bits	20	20
k_sparse	3	3
hidden	2000	1000
n_train	200	500
n_test	200	200
lr	0.5	0.1
wd	0.01	0.01
batch_size	1 (online)	32 (mini-batch)
max_epochs	50	200 (solved at 52)

Observations¶

Phase transition / grokking pattern¶

The training curve follows the grokking pattern:

Epochs 1-20: Train accuracy rises to ~80%, test accuracy stays at chance (~50%). The model memorizes.
Epochs 20-40: Train accuracy continues climbing, test accuracy begins to move (59% at epoch 30, 75% at epoch 40).
Epochs 40-52: Phase transition. Test accuracy jumps from 75% to 99% in about 10 epochs.

Hidden progress tracking¶

The L1 weight movement ||w_t - w_0||_1 grew steadily throughout training: - Epoch 1: 241 - Epoch 30: 3,109 (test acc still ~59%) - Epoch 43: 3,830 (test acc crosses 90%) - Epoch 52: 4,149 (solved at 99%)

This confirms the "hidden progress" phenomenon from Barak et al.: SGD moves weights well before test accuracy responds. The weight movement metric is a useful leading indicator.

What fixed it¶

Two changes mattered most:

Mini-batch SGD (batch_size=32): Single-sample online SGD produces noisy gradient estimates. Averaging over 32 samples provides a cleaner signal, especially for sparse parity where 3 of 20 bits are relevant.
Lower learning rate (0.1 vs 0.5): LR=0.5 with noisy single-sample gradients causes overshooting. LR=0.1 with mini-batches gives stable convergence.

More training data also helped (n_train=500 vs 200): more samples reduce overfitting to noise bits. With 500 samples and batch_size=32, each epoch has ~16 update steps.

Performance¶

Total training time: ~111 seconds (pure Python, no NumPy)
Steps to solve: 832 (52 epochs x 16 batches/epoch)
This is well within the n^O(k) ~ 8000 theoretical bound from Barak et al.

Secret indices¶

The randomly selected parity bits were [0, 3, 8] out of 20 bits. The model successfully learned to ignore the 17 noise bits.

Next Steps¶

Experiment 2: Sweep weight decay to see if it can accelerate the phase transition
Experiment 3: Try Sign SGD to see if it matches the SQ lower bound
Experiment 4: Try GrokFast to see if amplifying slow gradients speeds up convergence

Files¶

Experiment script: src/sparse_parity/experiments/exp1_fix_hyperparams.py
Results JSON: results/exp1_20260303_221628/results.json