Skip to content

Experiment Curriculum: Curriculum Learning for Scaling Sparse Parity

Date: 2026-03-04 Status: SUCCESS Answers: Open question #3, "Can curriculum learning help at scale?"

Hypothesis

If we train on small n first (where grokking is fast) and then expand W1 to larger n, the network will transfer its learned feature detector and solve the larger problem with far fewer total epochs than direct training.

Config

Parameter Value
hidden 200
lr 0.1
wd 0.01
batch_size 32
n_train 1000
n_test 200
seed 42
method numpy mini-batch SGD + curriculum

Results

Test 1: n-curriculum [10 -> 20], k=3

Method Best Acc Total Epochs Wall Time
Direct n=20 98.0% 33 0.11s
Curriculum 10->20 100.0% 19 0.07s
Speedup 1.7x fewer epochs 1.6x faster

Test 2: n-curriculum [10 -> 30 -> 50], k=3

Method Best Acc Total Epochs Wall Time
Direct n=50 95.5% 292 1.03s
Curriculum 10->30->50 98.5% 20 0.06s
Speedup 14.6x fewer epochs 17.2x faster

Test 3: k-curriculum n=20, [k=2 -> k=3 -> k=5]

Method Best Acc Total Epochs Wall Time
Direct n=20/k=5 96.5% 232 0.64s
Curriculum k=2->3->5 95.0% 157 0.46s
Speedup 1.5x fewer epochs 1.4x faster

Summary Table

Method Target Acc Epochs Time vs Direct
Direct n=20/k=3 98.0% 33 0.11s baseline
n-curr 10->20 n=20/k=3 100.0% 19 0.07s 1.7x
Direct n=50/k=3 95.5% 292 1.03s baseline
n-curr 10->30->50 n=50/k=3 98.5% 20 0.06s 14.6x
Direct n=20/k=5 96.5% 232 0.64s baseline
k-curr 2->3->5 n=20/k=5 95.0% 157 0.46s 1.5x

Analysis

What worked

  • n-curriculum is effective: 10->30->50 solves n=50/k=3 in 20 total epochs vs 292 for direct training, a 14.6x improvement.
  • Transfer is near-instant: After expanding W1 from n=10 to n=20, the network achieves 100% test accuracy in epoch 1. The trained feature detector on the 3 secret bits transfers because the new columns are initialized small and don't interfere.
  • n=50 solved via curriculum: Direct training on n=50/k=3 previously failed at 54% in exp_d (200 epochs). Here with 500 epochs it reaches 95.5%. Curriculum solves it in 20 epochs total, bypassing the grokking plateau.
  • k-curriculum also helps: 1.5x speedup for k=5, less dramatic than n-curriculum.

What didn't work

  • k-curriculum has limited transfer: Going from k=2 to k=5 doesn't transfer cleanly because the parity function changes structure (from 2-way to 5-way XOR). The network must still learn the new interaction pattern.
  • k-curriculum for k=5 still took 140 epochs in the final phase, most of the work. The k=2 and k=3 warmup helped but didn't short-circuit k=5 learning.

Surprise

The n-curriculum expansion is almost free. When going from n=10 to n=20 to n=50, each expansion phase solves in 1 epoch. The network has already learned "look at bits 1, 5, 8 and compute their product." Adding irrelevant input columns (initialized with near-zero weights) doesn't break this. The entire cost is the initial n=10 training (18 epochs), which is trivially fast.

n-curriculum neutralizes the n^k scaling wall for k=3. The hard part (finding the secret bits) is done at small n where it is cheap. Expansion is free.

Open Questions (for next experiment)

  • Can n-curriculum solve n=100 or n=200 with k=3? The transfer should still work since we're just adding more noise columns.
  • Does n-curriculum help when k is large (k=5, k=7)? The initial n=10/k=5 training itself might be slow.
  • Can we combine n-curriculum + k-curriculum? Start with n=10/k=2, scale up both.
  • What is the ARD profile of curriculum training? Fewer epochs means fewer total memory accesses, but each expansion phase re-reads W1 at the new size.

Files

  • Experiment: src/sparse_parity/experiments/exp_curriculum.py
  • Results: results/exp_curriculum/results.json