Experiment: GPU vs CPU for Sparse Parity Methods¶
Date: 2026-03-14 Status: FINDING Issue: #6
Question¶
Does running sparse parity methods on GPU (PyTorch CUDA) produce useful energy or performance data? Sub-question of "Is ARD or DMC the better energy proxy?"
What was performed¶
Reimplemented three sparse parity methods (GF(2), SGD, KM) in PyTorch CUDA. Ran them on an NVIDIA L4 via Modal Labs, matching the numpy harness config: n=20, k=3, hidden=200, lr=0.1, wd=0.01, batch=32, hinge loss, seed=42.
Ran 5 times to measure variance. Compared against CPU numpy baselines from bin/reproduce-all.
What was produced¶
GPU times (5 runs on NVIDIA L4 via Modal)¶
| Run | GF(2) | SGD (37 epochs) | KM |
|---|---|---|---|
| 1 | 1.7ms | 1014ms | 663ms |
| 2 | 2.1ms | 1676ms | 1016ms |
| 3 | 2.0ms | 1367ms | 844ms |
| 4 | 2.3ms | 1603ms | 899ms |
| 5 | 2.0ms | 1571ms | 921ms |
| Mean | 2.0ms | 1446ms | 869ms |
| Std | 0.2ms | 254ms | 127ms |
100% accuracy on all 5 runs, 37 epochs for SGD on all 5 runs.
GPU vs CPU comparison¶
| Method | CPU (numpy) | GPU mean | GPU/CPU ratio |
|---|---|---|---|
| GF(2) | 0.5ms | 2.0ms | 4x slower |
| SGD | 142ms | 1446ms | 10x slower |
| KM | 1.1ms | 869ms | 790x slower |
Cost¶
$0.002-0.003 per run. Total for 5 runs: ~$0.012.
Can it be reproduced?¶
# GPU (requires Modal account)
pip install modal
modal token set
modal run bin/gpu_energy.py
# CPU baseline
PYTHONPATH=src python3 bin/reproduce-all
Finding¶
GPU is 4-790x slower than CPU for sparse parity at n=20/k=3. Consistent across 5 runs.
- GF(2) is sequential row reduction (XOR). Can't parallelize. Runs on CPU even inside a GPU container. 4x overhead from container/PyTorch setup.
- SGD has the same epoch count (37) but each epoch is 10x slower. The weight matrix (200x20) is too small for CUDA kernel launch overhead to amortize.
- KM is 790x slower. It runs 20 small independent operations, each launching a CUDA kernel for a tiny tensor.
The ARD vs DMC proxy comparison is still unanswered. These workloads don't stress the GPU memory subsystem, so measuring GPU energy here tells you nothing about memory access patterns. That question needs nanoGPT-scale workloads.
What's useful: The bin/gpu_energy.py pipeline works (PyTorch on Modal, matching Yaroslav's gpu_toy.py pattern). When the group moves to nanoGPT, this script is the starting point. At sparse parity scale, use CPU wall-clock time.
Files¶
- Script:
bin/gpu_energy.py - This document:
findings/exp_proxy_comparison.md - Reproduce:
modal run bin/gpu_energy.py