Skip to content

Experiment exp_perlayer_batch: Per-layer + batching combined

Date: 2026-03-04 Status: SUCCESS Answers: Open Question #4, "Does per-layer + batching combine?"

Hypothesis

If we combine per-layer forward-backward with mini-batch SGD (batch=32), we'll get the convergence reliability of batching with the ARD benefit of per-layer, without harming convergence speed.

Config

Parameter Value
n_bits 20
k_sparse 3
hidden 200
lr 0.1
wd 0.01
batch_size 1 (single) / 32 (batch)
max_epochs 200
n_train 1000
seeds 42, 43, 44, 45, 46
method standard / perlayer x single / batch

Results

Metric Value
All variants solve? Yes, 5/5 seeds each
Best convergence speed single-sample (5.2 epochs avg)
Best wall time standard+single (0.058s)
Per-layer + batch epoch count 40.6 avg (vs 41.4 standard+batch)
Per-layer + batch wall time 0.665s (3.7x slower than standard+batch due to re-forward)

Summary Table

Variant Solved Avg Epochs Avg Wall Time
standard+single 5/5 5.2 0.058s
standard+batch 5/5 41.4 0.177s
perlayer+single 5/5 5.2 0.221s
perlayer+batch 5/5 40.6 0.665s

Analysis

What worked

  • Per-layer + batch converges reliably (5/5 seeds, 100% test accuracy)
  • Per-layer + batch needs slightly fewer epochs than standard+batch (40.6 vs 41.4), a ~2% improvement
  • Per-layer does not hurt convergence in either single or batch mode
  • Single-sample variants are much faster in epochs (5.2 vs ~41)

What didn't work

  • Per-layer + batch is 3.7x slower in wall time than standard+batch (0.665s vs 0.177s) because it requires a re-forward pass through layer 2 after updating W1
  • Batching always needs ~8x more epochs than single-sample for this problem size
  • The per-layer epoch-count advantage over standard is negligible with batching (2% vs identical for single-sample)

Surprise

  • Single-sample SGD is much better than batch for this problem: 5.2 vs 41 epochs. Single-sample SGD's frequent updates find the phase transition faster. Batching's value (from exp1) was about training stability, not speed.

Open Questions (for next experiment)

  • Does per-layer + batch have better ARD than standard+batch? The re-forward pass adds compute but keeps W1 in cache during its update-then-reuse cycle. Need cache-simulation MemTracker to measure this properly.
  • Can we reduce the epoch gap between single-sample and batch by using smaller batches (e.g., batch=4 or batch=8)?
  • At larger scale (n=30+), does the per-layer batch advantage grow?

Files

  • Experiment: src/sparse_parity/experiments/exp_perlayer_batch.py
  • Results: results/exp_perlayer_batch/results.json