Skip to content

Experiment exp_equilibrium_prop: Equilibrium Propagation for Sparse Parity

Date: 2026-03-06 Status: FAILED Approach: #9 -- Scellier & Bengio (2017) Equilibrium Propagation

Hypothesis

Equilibrium propagation can solve sparse parity without backpropagation by using two forward relaxation phases (free and clamped). The weight update rule dW = (1/beta) * (s_clamped - s_free) approximates backprop gradients in the limit of small beta. Since EP uses only forward passes, it should have fundamentally different ARD characteristics than backprop.

Config

Parameter n=3/k=3 n=20/k=3
n_bits 3 20
k_sparse 3 3
hidden 100 1000
n_train 50 500
n_test 50 200
lr 0.1 0.05
beta 0.2 0.2
free_steps 30 30
clamp_steps 30 30
step_size 0.5 0.5
max_epochs 200 100
seeds 42, 43, 44 42, 43, 44

Results

Config Seed Best Train Best Test Converge Epoch Time Weighted ARD
n=3, k=3 42 0.500 0.440 - 8.3s 18,121
n=3, k=3 43 0.620 0.820 - 8.3s 18,121
n=3, k=3 44 0.620 0.740 - 8.3s 18,121
n=20, k=3 42 0.528 0.525 - 91.9s 711,003
n=20, k=3 43 0.588 0.555 - 95.1s 711,003
n=20, k=3 44 0.768 0.745 - 94.4s 711,003

Averages

Config Avg Train Avg Test Avg Time Avg ARD
n=3, k=3 0.580 0.667 8.3s 18,121
n=20, k=3 0.628 0.608 93.8s 711,003

Analysis

What worked

  • The algorithm runs and produces gradient updates -- the EP framework is correctly implemented with free and clamped phases, and weights do update.
  • Some seeds show partial learning -- seed 43 on n=3/k=3 reached 82% test accuracy, and seed 44 on n=20/k=3 reached 74.5% test accuracy, both well above the 50% chance baseline.
  • ARD is well-defined -- the two-phase relaxation process has a clear memory access pattern: 401 reads, 134 writes per training step (for n=3/k=3 with 30 relaxation steps per phase).

What didn't work

  • Failed to converge on any config -- no seed reached 90% test accuracy on either n=3/k=3 or n=20/k=3. The network gets stuck in local minima where the output saturates to a constant prediction.
  • Tanh saturation trap -- the network quickly saturates (cost hits a plateau like 1.0 or 0.76 and stays there for hundreds of epochs). Once the output node saturates, the EP gradient signal vanishes because d(tanh)/d(pre) approaches 0.
  • Very slow per epoch -- each epoch requires 2 * n_steps forward relaxation iterations per sample. For n=20/k=3 with 500 training samples and 30 steps per phase, that is 30,000 forward computations per epoch, taking ~0.9s per epoch. Backprop baseline solves this in ~5 epochs (0.12s total).
  • No grokking observed -- unlike SGD which can grok after many epochs, EP shows no sign of delayed generalization. The loss plateau is a hard wall.

Surprise

  • EP struggles even on n=3/k=3 (full parity) -- this is the easiest possible config where all bits matter, yet EP cannot reliably solve it. SGD solves this trivially in 1-3 epochs. This suggests the EP relaxation dynamics are fundamentally mismatched with the parity function's XOR-like structure.
  • Seed-dependent partial learning -- seed 44 on n=20/k=3 reached 76.8% train accuracy while seed 42 was stuck at 52.8%. The initial random weights determine whether the network finds any useful features at all, suggesting a very rough loss landscape.
  • Compute cost is extreme -- n=20/k=3 took 93.8s average per seed for 100 epochs (total ~281s across 3 seeds). SGD baseline solves it in 0.12s. That is a ~2,300x slowdown for a worse result.

Comparison with Backprop and Forward-Forward

Method n=3/k=3 Test n=20/k=3 Test Per-step ARD (n=3)
SGD (backprop) 1.000 1.000 ~4,000 (est.)
Forward-Forward 0.500-0.660 0.500-0.515 ~2,900
Equilibrium Prop 0.440-0.820 0.525-0.745 18,121

EP has higher ARD than both backprop and Forward-Forward because the iterative relaxation requires repeatedly reading the full weight matrices (30 times per phase, 60 total per training step).

Open Questions

  • Would a continuous Hopfield network formulation with modern Hopfield energy work better?
  • Can the saturation problem be fixed with layer normalization or different activation functions?
  • Would a contrastive Hebbian learning variant (a predecessor of EP) perform differently?
  • Is the fundamental issue that parity requires precise cancellation, which iterative relaxation cannot achieve?

Files

  • Experiment: src/sparse_parity/experiments/exp_equilibrium_prop.py
  • Results: results/exp_equilibrium_prop/results.json