Experiment Cache-ARD: Cache-Aware Memory Tracking¶

Date: 2026-03-04 Status: SUCCESS Answers: Open question #2, "What does ARD look like with a cache model?"

Hypothesis¶

If we add LRU cache simulation to MemTracker, then batch-32 will show dramatically higher hit rates than single-sample, because parameters stay resident in cache across the batch.

Config¶

Parameter	Value
n_bits	20
k_sparse	3
hidden	200 and 1000
lr	0.1
wd	0.01
batch_size	32
max_epochs	1 (single step)
n_train	500
seed	42
method	standard (single-sample vs batch)

Results¶

Metric	Value
Best test accuracy	N/A (single step, not training to convergence)
Epochs to >90%	N/A
Wall time	<1s per comparison
CacheTracker implemented	YES
Finding	L2 cache eliminates ALL misses for both methods

Summary Table¶

Hidden	Cache	W1 fits?	Method	Hit Rate	Eff. ARD	Misses	Total Floats
200	L1 (32KB)	YES	single-sample	100%	0	0	492,731
200	L1 (32KB)	YES	batch-32	73%	112,187	140	364,278
200	L2 (256KB)	YES	single-sample	100%	0	0	492,731
200	L2 (256KB)	YES	batch-32	100%	0	0	364,278
1000	L1 (32KB)	NO	single-sample	91%	33,848	49	2,455,931
1000	L1 (32KB)	NO	batch-32	69%	612,424	188	2,132,040
1000	L2 (256KB)	YES	single-sample	100%	0	0	2,455,931
1000	L2 (256KB)	YES	batch-32	100%	0	0	2,132,040

Analysis¶

What worked¶

CacheTracker correctly extends MemTracker with LRU simulation
At L2 (256KB), both methods achieve 100% hit rate. The entire working set fits, confirming exp_b's intuition
Batch-32 accesses 13% fewer total floats than 32 single-sample steps (2.13M vs 2.46M for hidden=1000), confirming exp_b's parameter-traffic reduction
The cache model gives a binary answer: if your cache fits W1, reuse distance is irrelevant

What didn't work¶

Batch does NOT have better L1 cache behavior than single-sample. It is worse
hidden=200, L1: batch 73% vs single-sample 100%
hidden=1000, L1: batch 69% vs single-sample 91%
The per-sample temporaries in batch (h_pre_0...h_pre_31, h_0...h_31, dh_0, etc.) thrash the L1 cache, evicting parameters that single-sample keeps resident

Surprise¶

Single-sample is more cache-friendly than batch at L1. Each single-sample step has a small working set (~hidden*3 + n_bits floats for temporaries) that fits alongside W1 in L1. Batch creates 32 sets of these temporaries, blowing out the cache. The assumption "batch reuses parameters" only holds when the cache is large enough to hold both parameters and all per-sample temporaries.

The real batch advantage is total traffic, not locality. Batch-32 does 16x fewer parameter writes (from exp_b) and 13% fewer total float accesses. On a memory-bandwidth-bound system, this is the actual energy saving, not cache hit rate.

Open Questions (for next experiment)¶

Can we get the best of both worlds? Tiled batching: process sub-batches of 4-8 that fit in L1 with their temporaries, accumulate gradients, then do one parameter update
What's the L1-optimal batch size? Somewhere between 1 and 32 there's a sweet spot where temporaries still fit alongside W1
Does per-layer + batching change the cache picture? Per-layer processes one layer at a time, potentially fitting more in L1

Files¶

CacheTracker: src/sparse_parity/cache_tracker.py
Experiment: src/sparse_parity/experiments/exp_cache_ard.py
Results: results/exp_cache_ard/results.json