Experiment exp_binary_weights: Binary Weights for Sparse Parity¶

Date: 2026-03-06 Status: PARTIAL SUCCESS (n=3 solved, n=20 failed) Approach: #13 -- BinaryConnect (Courbariaux et al. 2015)

Hypothesis¶

If we train a 2-layer network with binary weights (+1/-1) and sign activation, sparse parity should be solvable because parity is fundamentally a binary operation. Binary weights reduce inference memory by 13x and energy per step by 1.5x. The straight-through estimator (STE) passes gradients through the sign function unchanged for training.

Config¶

Parameter	Value
n_bits	3, 20
k_sparse	3
hidden	100 (n=3), 200 and 1000 (n=20)
lr	0.01, 0.1
wd	0.01
batch_size	32
max_epochs	200
n_train	100 (n=3), 1000 (n=20)
n_test	100 (n=3), 500 (n=20)
seeds	42, 43, 44
method	BinaryConnect (sign weights + STE) vs float32 baseline

Results¶

Config	Method	Avg Acc	Solved	Avg Epochs	Avg Time
n=3, k=3, h=100	binary	100%	3/3	1	0.00s
n=3, k=3, h=100	float32	100%	3/3	68	0.02s
n=20, k=3, h=200, lr=0.1	binary	55.5%	0/3	200	1.51s
n=20, k=3, h=200, lr=0.1	float32	100%	3/3	39	0.16s
n=20, k=3, h=1000, lr=0.01	binary	59.7%	0/3	200	5.29s
n=20, k=3, h=1000, lr=0.01	float32	57.6%	0/3	200	1.90s
n=20, k=3, h=1000, lr=0.1	binary	54.9%	0/3	200	5.01s
n=20, k=3, h=1000, lr=0.1	float32	100%	3/3	51	0.39s

Energy and Memory Comparison (per-step, first mini-batch)¶

Config	Binary ARD	Float32 ARD	Binary Energy Proxy	Float32 Energy Proxy	Ratio
n=3, k=3	4,667	7,679	150M	242M	1.6x
n=20, k=3, h=200	13,726	19,103	987M	1,602M	1.6x
n=20, k=3, h=1000	65,811	91,467	23.9B	36.5B	1.5x

Config	Float32 Inference	Binary Inference	Reduction
n=3, k=3	2.0 KB	0.4 KB	4.4x
n=20, k=3 (h=1000)	85.9 KB	6.5 KB	13.3x

Analysis¶

What worked¶

n=3/k=3 solved in 1 epoch by binary weights vs 68 epochs for float32. When all input bits are parity bits, the sign activation aligns perfectly with the structure: binary weights + sign activation can represent 3-bit parity exactly.
Energy per step is 1.5-1.6x lower for binary due to smaller weight reads (counted as 1/32 of float reads).
Inference memory is 13x smaller for n=20/k=3 with h=1000 (6.5 KB vs 85.9 KB).

What did not work¶

Binary weights completely fail on n=20/k=3: ~55% accuracy (near random chance) across all learning rates and hidden sizes. The float32 baseline solves the same configs at 100%.
The straight-through estimator is too crude: The sign function has zero derivative everywhere except at zero. STE approximates this as identity (gradient = 1), which is a poor approximation. For n=20, the network needs to learn which 3 of 20 input bits matter -- this requires precise gradient information that STE destroys.
Sign activation kills the gradient signal: Unlike ReLU (which passes gradients for positive inputs), sign activation with STE passes all gradients unchanged. This removes the selective gating that helps ReLU networks learn feature selection in sparse parity.

Root cause of failure¶

The fundamental issue is that BinaryConnect was designed for classification tasks where float32 networks already work and binarization is a compression technique. For sparse parity, the learning problem itself (feature selection) requires gradient precision. The STE approximation is adequate for MNIST/CIFAR where the loss landscape is smooth, but sparse parity has a sharp phase transition that requires precise gradient accumulation to navigate.

The n=3/k=3 success is misleading -- when all bits are relevant, there is no feature selection problem. The network just needs to learn the parity function, which binary weights represent exactly.

Surprise¶

Binary weights solve n=3/k=3 in a single epoch -- 80x faster than float32. This confirms that parity IS fundamentally a binary operation, but only when the feature selection problem is trivial.
The float32 baseline with h=1000, lr=0.01 also fails (57.6%), showing that the learning rate matters more than the weight precision for sparse parity convergence.

Open Questions (for next experiment)¶

Can a hybrid approach work? Use float32 for the first layer (feature selection) and binary for the second layer (parity computation).
Would a better gradient estimator (e.g., clipped STE, or the Gumbel-softmax trick) help binary weights learn feature selection?
Can binary weights succeed if we pre-select the relevant features (e.g., via LASSO or mutual information) and only need to learn the parity function on the selected features?

Files¶

Experiment: src/sparse_parity/experiments/exp_binary_weights.py
Results: results/exp_binary_weights/results.json