Experiment exp_tiled_w1: Tiled W1 Updates for Sparse Parity¶
Date: 2026-03-06 Status: FAILED (ARD increased instead of decreased)
Hypothesis¶
W1 (input-to-hidden weights) dominates ARD at 75% of all float reads. W1 is 20x1000 = 20,000 floats = 80KB, which exceeds L1 cache (64KB). Splitting W1 into tiles along the hidden dimension (e.g., T=250 -> 5,000 floats = 20KB per tile) and processing each tile's forward/backward/update before moving to the next should keep each tile in L1 cache, reducing reuse distance.
Config¶
| Parameter | Value |
|---|---|
| n_bits | 20 |
| k_sparse | 3 |
| hidden (ARD) | 1000 |
| hidden (accuracy) | 500 |
| tile_sizes | 50, 100, 250, 500 |
| lr | 0.1 |
| wd | 0.01 |
| n_train | 500 |
| n_test | 200 |
| seeds | 42, 43, 44 |
Results¶
ARD Measurement (hidden=1000, 1 step)¶
| Method | Avg ARD | ARD Change | Tile KB | N Tiles |
|---|---|---|---|---|
| Baseline | 30,759 | --- | N/A | 1 |
| Tiled T=50 | 34,492 | +12.1% | 3.9 | 20 |
| Tiled T=100 | 34,236 | +11.3% | 7.8 | 10 |
| Tiled T=250 | 33,690 | +9.5% | 19.5 | 4 |
| Tiled T=500 | 32,853 | +6.8% | 39.1 | 2 |
Accuracy Verification (hidden=500, max_epochs=50)¶
| Method | Avg Acc | Avg Epochs |
|---|---|---|
| Baseline | 99.8% | 23 |
| Tiled T=50 | 99.8% | 23 |
| Tiled T=100 | 99.8% | 23 |
| Tiled T=250 | 99.8% | 23 |
| Tiled T=500 | 99.8% | 23 |
Analysis¶
What happened¶
Tiling W1 increased ARD by 6.8%-12.1% instead of decreasing it. The core problem is structural:
-
Forward pass cannot be fully tiled: The output layer needs the full
hvector (all hidden units), so all tiles must complete their forward pass before the output layer can run. This means tile 0's W1 slice is read in forward, then tiles 1-19 all run their forward passes, then the output layer runs, then backward starts. The W1_tile0 forward-to-backward distance is longer than in baseline because the tiled forward is not more compact -- it is the same total computation, just reorganized. -
Additional x reads: In standard backprop,
xis read once in forward and once in backward (2 reads total). In tiled mode,xis read once per tile in forward and once per tile in backward (2*N_tiles reads). This extra reading adds to total floats accessed and increases distances for other buffers. -
The MemTracker measures software-level reuse distance: It counts intervening float accesses between write and read of a buffer. Tiling does not reduce the amount of computation between W1_tile_k's forward read and backward read -- the output layer and backward output layer still intervene. The only way tiling helps is at the hardware cache level (L1/L2), which the MemTracker does not simulate.
What did work¶
- Accuracy is perfectly maintained: Tiling does not change the mathematics of backprop. All tile sizes produce identical accuracy and convergence speed as baseline.
- Smaller tiles trend toward higher ARD: More tiles = more overhead from repeated
xreads and more buffer fragmentation.
Key insight¶
The MemTracker ARD metric measures software-level reuse distance (intervening float accesses between producer and consumer). Tiling is a hardware-level cache optimization that works by ensuring a W1 tile fits in L1 so it remains cached between forward and backward. These are fundamentally different metrics:
- Software ARD: determined by the total computation between forward and backward. Tiling cannot reduce this.
- Hardware cache hit rate: determined by whether a buffer fits in L1/L2. Tiling helps here because a 20KB tile fits in 64KB L1, while the full 80KB W1 does not.
To measure the hardware benefit, one would need the CacheTracker (LRU cache simulation) rather than the MemTracker.
Open Questions¶
- Would CacheTracker (from
cache_tracker.py) show the expected L1 hit rate improvement for tiled W1? - Can we design a "fully tiled" approach where each tile's backward runs immediately after its forward, before processing the next tile? This would require approximating the output layer or using a different loss formulation.
- Is there a tiling strategy that also tiles the output layer (W2) to keep h_tile in cache?
Files¶
- Experiment:
src/sparse_parity/experiments/exp_tiled_w1.py - Results:
results/exp_tiled_w1/results.json