Experiment B: Mini-Batch ARD vs Single-Sample ARD¶

Question¶

How much does mini-batch SGD improve energy efficiency (as measured by ARD) compared to single-sample SGD?

Setup¶

Network: 20-bit input, 1000 hidden units (ReLU), 1 output (hinge loss)
Comparison: 32 consecutive single-sample SGD steps vs one batch-32 mini-batch step
Both process the same 32 samples from the same initial weights
MemTracker instruments every read/write at the buffer level

Result: ARD is the wrong metric for batch comparison¶

Batch-32 has 17x HIGHER weighted ARD than single-sample (547,881 vs 31,500).

The reason is instructive.

Why batch ARD is higher¶

The MemTracker uses a flat clock that advances by buffer size on every access. In the batch case:

Parameters (W1=20,000 floats, W2=1,000 floats) are read ONCE at batch start
Then 32 samples worth of per-sample activations/gradients are interleaved (~60K floats of clock advance per sample)
Parameters are read again only during gradient accumulation and the final update

This means the reuse distance for W1 in the batch case is ~1M floats (the entire batch of temporaries sits between reads), while in single-sample it's ~34K (just one sample's temporaries).

The MemTracker's ARD model penalizes holding parameters in cache across the whole batch, which is what makes batching efficient on real hardware with large caches.

What batch DOES improve: total parameter traffic¶

Metric	Single-Sample (32 steps)	Batch-32 (1 step)
Total floats accessed	2,455,931	2,132,040
W1 reads	49	34
W1 writes	32	2
Parameter writes (all)	128	8

Single-sample reads and writes W1 (20K floats) on every sample. Batch reads W1 once at start + once at update = 2 reads total for the forward pass + update. This is a 16x reduction in parameter write traffic and a large read reduction.

Batch size sweep (floats/sample)¶

Batch Size	Floats/Sample	Relative to BS=1
1	48,046	1.00x
4	85,560	1.78x
8	75,688	1.58x
16	64,123	1.33x
32	66,626	1.39x
64	62,907	1.31x

The floats/sample is higher for batches because gradient accumulators (acc_dW1 = 20K floats) must be read and written for every contributing sample. But this is purely accumulator overhead -- parameters themselves are accessed far fewer times.

Analysis: Two competing effects¶

Parameter reuse (good for energy): In batch mode, W1/W2/b1/b2 are loaded once and reused across 32 forward passes. On real hardware with sufficient cache, this eliminates 31/32 = 97% of parameter loads from DRAM.
Accumulator overhead (bad for ARD metric): Each sample's gradients must be accumulated into shared buffers (acc_dW1, acc_db1, etc.), adding read-modify-write cycles that inflate the clock and push parameter reuse distances up.

Per-sample temporaries also fragment locality: each sample creates its own h_pre_i, h_i, dh_pre_i etc. These unique buffers never get reused, adding to total traffic without benefiting from caching.

Conclusion¶

Batch size is a lever for energy, but ARD (as currently defined) does not capture it well.

The current MemTracker measures reuse distance in a flat address-time space. It correctly identifies that parameter buffers are accessed far apart in the batch case. But on real hardware, if the cache is large enough to hold W1 (20K floats = 80KB), the parameters stay resident and the distance is irrelevant: all cache hits.

Recommendations¶

Add a cache-simulation mode to MemTracker: Given a cache size C, count how many accesses are hits vs misses. For C >= 80KB (W1 fits), batch should show dramatic hit-rate improvement.
Track parameter traffic separately: The metric "parameter bytes loaded from DRAM per sample" would directly capture the batch benefit: ~48K/sample for BS=1 vs ~2K/sample for BS=32 (amortized).
Batch size is an energy lever: Even without fixing the metric, the raw numbers show 16x fewer parameter writes and fewer parameter reads. For a memory-bandwidth-bound system, this is a proportional energy saving.
Optimal batch size for this model: The floats/sample metric plateaus around BS=16-32. Larger batches see diminishing returns because the accumulator overhead scales linearly while the parameter savings are already near-maximum.

Files¶

Experiment: src/sparse_parity/experiments/exp_b_batch_ard.py
Results: results/exp_b_batch_ard/results.json