Skip to content

Experiment exp_mutual_info: Mutual Information Estimation for Sparse Parity

Date: 2026-03-06 Status: SUCCESS Answers: Blank slate approach -- can MI between parity products and labels identify the secret subset?

Hypothesis

For sparse parity with secret S, MI(y; prod(x_S)) should equal log(2) ~ 0.693 nats (1 bit) for the true subset, and ~0 for all wrong subsets. Unlike Fourier/Walsh-Hadamard (which computes exact correlation), MI detects any nonlinear relationship via a 2x2 contingency table. For binary parity, both approaches should work, but MI generalizes to non-binary settings.

Config

Parameter Value
n_bits 20, 50
k_sparse 3, 5
method exhaustive MI over all C(n,k) subsets
MI estimator 2x2 contingency table (exact for binary {-1,+1})
n_samples 500
seeds 42, 43, 44

Results

Config C(n,k) Avg MI (true) Avg MI (wrong) Avg MI gap Avg time Correct
n=20, k=3 1,140 0.6925 0.0010 0.6799 0.033s 3/3
n=50, k=3 19,600 0.6922 0.0010 0.6731 0.569s 3/3
n=20, k=5 15,504 0.6915 0.0010 0.6756 0.453s 3/3

Key Table

Method n=20/k=3 n=50/k=3 n=20/k=5
MI (this exp) 0.033s, 100% 0.569s, 100% 0.453s, 100%
Fourier (exp_fourier) 0.009s, 100% 0.13s, 100% 1.2s, 100%
Random search (exp_evolutionary) 0.011s, 100% 0.142s, 100% 0.426s, 100%
SGD (baseline) 0.12s, 100% FAIL (54%) 14 epochs (n=5000)
SGD + curriculum -- 20 epochs, >90% --

Analysis

What worked

  • MI identifies the true subset with 100% accuracy on all configs and all seeds. The true subset's MI is always within 0.002 nats of the theoretical maximum log(2) = 0.693 nats.
  • The MI gap is enormous: best MI ~ 0.69 vs next-best ~ 0.01-0.02. The signal-to-noise ratio is ~40-70x, making identification trivial.
  • MI solves n=50/k=3 in 0.57s, a config that SGD fails on (54% accuracy) without curriculum learning.
  • Sample complexity is low: 20 samples suffice for n=20/k=3 (3/3 correct). The MI gap grows with more samples (0.35 at 20 samples, 0.68 at 500 samples) but even small gaps are enough for correct identification.

What didn't work

  • MI is ~3.7x slower than Fourier on n=20/k=3 (0.033s vs 0.009s) and ~4.4x slower on n=50/k=3 (0.569s vs 0.13s). The contingency table computation (looping over 2x2 cells, computing boolean masks) is heavier than Fourier's single mean(y * parity) operation.
  • MI is faster than Fourier on n=20/k=5 (0.453s vs 1.2s). This appears to be due to implementation differences rather than algorithmic advantage; both are O(C(n,k) * n_samples).
  • ARD is identical to Fourier (1,147,375 vs 17,976 for SGD). Both methods stream over the entire dataset for each subset with no weight reuse.

Surprise

  • MI and Fourier give identical ARD (1,147,375). This makes sense: both iterate over all C(n,k) subsets and read x and y for each one. The memory access pattern is the same streaming pattern regardless of whether you compute correlation or MI.
  • Wrong subsets have MI extremely close to 0 (mean ~0.001 nats). For independent binary variables with 500 samples, the finite-sample MI bias is about 1/(2 * n_samples * ln(2)) ~ 0.001, which matches perfectly. MI correctly reports near-zero for independent variables.
  • MI provides no advantage over Fourier for binary parity. Both detect the same signal (perfect correlation/dependence between parity product and label). MI's generality (detecting non-linear relationships) is unnecessary here because the relationship IS the linear product.

Open Questions

  • Does MI outperform Fourier on a modified parity problem where the label is a nonlinear function of the parity (e.g., y = sign(prod(x_S) + noise))? MI should be more robust to such corruptions.
  • Can kNN-based MI estimators (Kraskov) work on continuous-valued features without the discretization step?
  • For large k, can MI be computed on partial subsets to prune the search space?

Files

  • Experiment: src/sparse_parity/experiments/exp_mutual_info.py
  • Results: results/exp_mutual_info/results.json