Experiment Curriculum: Curriculum Learning for Scaling Sparse Parity
Date: 2026-03-04
Status: SUCCESS
Answers: Open question #3, "Can curriculum learning help at scale?"
Hypothesis
If we train on small n first (where grokking is fast) and then expand W1 to larger n, the network will transfer its learned feature detector and solve the larger problem with far fewer total epochs than direct training.
Config
| Parameter |
Value |
| hidden |
200 |
| lr |
0.1 |
| wd |
0.01 |
| batch_size |
32 |
| n_train |
1000 |
| n_test |
200 |
| seed |
42 |
| method |
numpy mini-batch SGD + curriculum |
Results
Test 1: n-curriculum [10 -> 20], k=3
| Method |
Best Acc |
Total Epochs |
Wall Time |
| Direct n=20 |
98.0% |
33 |
0.11s |
| Curriculum 10->20 |
100.0% |
19 |
0.07s |
| Speedup |
— |
1.7x fewer epochs |
1.6x faster |
Test 2: n-curriculum [10 -> 30 -> 50], k=3
| Method |
Best Acc |
Total Epochs |
Wall Time |
| Direct n=50 |
95.5% |
292 |
1.03s |
| Curriculum 10->30->50 |
98.5% |
20 |
0.06s |
| Speedup |
— |
14.6x fewer epochs |
17.2x faster |
Test 3: k-curriculum n=20, [k=2 -> k=3 -> k=5]
| Method |
Best Acc |
Total Epochs |
Wall Time |
| Direct n=20/k=5 |
96.5% |
232 |
0.64s |
| Curriculum k=2->3->5 |
95.0% |
157 |
0.46s |
| Speedup |
— |
1.5x fewer epochs |
1.4x faster |
Summary Table
| Method |
Target |
Acc |
Epochs |
Time |
vs Direct |
| Direct |
n=20/k=3 |
98.0% |
33 |
0.11s |
baseline |
| n-curr 10->20 |
n=20/k=3 |
100.0% |
19 |
0.07s |
1.7x |
| Direct |
n=50/k=3 |
95.5% |
292 |
1.03s |
baseline |
| n-curr 10->30->50 |
n=50/k=3 |
98.5% |
20 |
0.06s |
14.6x |
| Direct |
n=20/k=5 |
96.5% |
232 |
0.64s |
baseline |
| k-curr 2->3->5 |
n=20/k=5 |
95.0% |
157 |
0.46s |
1.5x |
Analysis
What worked
- n-curriculum is effective: 10->30->50 solves n=50/k=3 in 20 total epochs vs 292 for direct training, a 14.6x improvement.
- Transfer is near-instant: After expanding W1 from n=10 to n=20, the network achieves 100% test accuracy in epoch 1. The trained feature detector on the 3 secret bits transfers because the new columns are initialized small and don't interfere.
- n=50 solved via curriculum: Direct training on n=50/k=3 previously failed at 54% in exp_d (200 epochs). Here with 500 epochs it reaches 95.5%. Curriculum solves it in 20 epochs total, bypassing the grokking plateau.
- k-curriculum also helps: 1.5x speedup for k=5, less dramatic than n-curriculum.
What didn't work
- k-curriculum has limited transfer: Going from k=2 to k=5 doesn't transfer cleanly because the parity function changes structure (from 2-way to 5-way XOR). The network must still learn the new interaction pattern.
- k-curriculum for k=5 still took 140 epochs in the final phase, most of the work. The k=2 and k=3 warmup helped but didn't short-circuit k=5 learning.
Surprise
The n-curriculum expansion is almost free. When going from n=10 to n=20 to n=50, each expansion phase solves in 1 epoch. The network has already learned "look at bits 1, 5, 8 and compute their product." Adding irrelevant input columns (initialized with near-zero weights) doesn't break this. The entire cost is the initial n=10 training (18 epochs), which is trivially fast.
n-curriculum neutralizes the n^k scaling wall for k=3. The hard part (finding the secret bits) is done at small n where it is cheap. Expansion is free.
Open Questions (for next experiment)
- Can n-curriculum solve n=100 or n=200 with k=3? The transfer should still work since we're just adding more noise columns.
- Does n-curriculum help when k is large (k=5, k=7)? The initial n=10/k=5 training itself might be slow.
- Can we combine n-curriculum + k-curriculum? Start with n=10/k=2, scale up both.
- What is the ARD profile of curriculum training? Fewer epochs means fewer total memory accesses, but each expansion phase re-reads W1 at the new size.
Files
- Experiment:
src/sparse_parity/experiments/exp_curriculum.py
- Results:
results/exp_curriculum/results.json