Experiment exp_equilibrium_prop: Equilibrium Propagation for Sparse Parity
Date: 2026-03-06
Status: FAILED
Approach: #9 -- Scellier & Bengio (2017) Equilibrium Propagation
Hypothesis
Equilibrium propagation can solve sparse parity without backpropagation by using two forward relaxation phases (free and clamped). The weight update rule dW = (1/beta) * (s_clamped - s_free) approximates backprop gradients in the limit of small beta. Since EP uses only forward passes, it should have fundamentally different ARD characteristics than backprop.
Config
| Parameter |
n=3/k=3 |
n=20/k=3 |
| n_bits |
3 |
20 |
| k_sparse |
3 |
3 |
| hidden |
100 |
1000 |
| n_train |
50 |
500 |
| n_test |
50 |
200 |
| lr |
0.1 |
0.05 |
| beta |
0.2 |
0.2 |
| free_steps |
30 |
30 |
| clamp_steps |
30 |
30 |
| step_size |
0.5 |
0.5 |
| max_epochs |
200 |
100 |
| seeds |
42, 43, 44 |
42, 43, 44 |
Results
| Config |
Seed |
Best Train |
Best Test |
Converge Epoch |
Time |
Weighted ARD |
| n=3, k=3 |
42 |
0.500 |
0.440 |
- |
8.3s |
18,121 |
| n=3, k=3 |
43 |
0.620 |
0.820 |
- |
8.3s |
18,121 |
| n=3, k=3 |
44 |
0.620 |
0.740 |
- |
8.3s |
18,121 |
| n=20, k=3 |
42 |
0.528 |
0.525 |
- |
91.9s |
711,003 |
| n=20, k=3 |
43 |
0.588 |
0.555 |
- |
95.1s |
711,003 |
| n=20, k=3 |
44 |
0.768 |
0.745 |
- |
94.4s |
711,003 |
Averages
| Config |
Avg Train |
Avg Test |
Avg Time |
Avg ARD |
| n=3, k=3 |
0.580 |
0.667 |
8.3s |
18,121 |
| n=20, k=3 |
0.628 |
0.608 |
93.8s |
711,003 |
Analysis
What worked
- The algorithm runs and produces gradient updates -- the EP framework is correctly implemented with free and clamped phases, and weights do update.
- Some seeds show partial learning -- seed 43 on n=3/k=3 reached 82% test accuracy, and seed 44 on n=20/k=3 reached 74.5% test accuracy, both well above the 50% chance baseline.
- ARD is well-defined -- the two-phase relaxation process has a clear memory access pattern: 401 reads, 134 writes per training step (for n=3/k=3 with 30 relaxation steps per phase).
What didn't work
- Failed to converge on any config -- no seed reached 90% test accuracy on either n=3/k=3 or n=20/k=3. The network gets stuck in local minima where the output saturates to a constant prediction.
- Tanh saturation trap -- the network quickly saturates (cost hits a plateau like 1.0 or 0.76 and stays there for hundreds of epochs). Once the output node saturates, the EP gradient signal vanishes because d(tanh)/d(pre) approaches 0.
- Very slow per epoch -- each epoch requires 2 * n_steps forward relaxation iterations per sample. For n=20/k=3 with 500 training samples and 30 steps per phase, that is 30,000 forward computations per epoch, taking ~0.9s per epoch. Backprop baseline solves this in ~5 epochs (0.12s total).
- No grokking observed -- unlike SGD which can grok after many epochs, EP shows no sign of delayed generalization. The loss plateau is a hard wall.
Surprise
- EP struggles even on n=3/k=3 (full parity) -- this is the easiest possible config where all bits matter, yet EP cannot reliably solve it. SGD solves this trivially in 1-3 epochs. This suggests the EP relaxation dynamics are fundamentally mismatched with the parity function's XOR-like structure.
- Seed-dependent partial learning -- seed 44 on n=20/k=3 reached 76.8% train accuracy while seed 42 was stuck at 52.8%. The initial random weights determine whether the network finds any useful features at all, suggesting a very rough loss landscape.
- Compute cost is extreme -- n=20/k=3 took 93.8s average per seed for 100 epochs (total ~281s across 3 seeds). SGD baseline solves it in 0.12s. That is a ~2,300x slowdown for a worse result.
Comparison with Backprop and Forward-Forward
| Method |
n=3/k=3 Test |
n=20/k=3 Test |
Per-step ARD (n=3) |
| SGD (backprop) |
1.000 |
1.000 |
~4,000 (est.) |
| Forward-Forward |
0.500-0.660 |
0.500-0.515 |
~2,900 |
| Equilibrium Prop |
0.440-0.820 |
0.525-0.745 |
18,121 |
EP has higher ARD than both backprop and Forward-Forward because the iterative relaxation requires repeatedly reading the full weight matrices (30 times per phase, 60 total per training step).
Open Questions
- Would a continuous Hopfield network formulation with modern Hopfield energy work better?
- Can the saturation problem be fixed with layer normalization or different activation functions?
- Would a contrastive Hebbian learning variant (a predecessor of EP) perform differently?
- Is the fundamental issue that parity requires precise cancellation, which iterative relaxation cannot achieve?
Files
- Experiment:
src/sparse_parity/experiments/exp_equilibrium_prop.py
- Results:
results/exp_equilibrium_prop/results.json