Date: 2026-03-06
Status: SUCCESS
Answers: Blank slate approach -- can MI between parity products and labels identify the secret subset?
Hypothesis
For sparse parity with secret S, MI(y; prod(x_S)) should equal log(2) ~ 0.693 nats (1 bit) for the true subset, and ~0 for all wrong subsets. Unlike Fourier/Walsh-Hadamard (which computes exact correlation), MI detects any nonlinear relationship via a 2x2 contingency table. For binary parity, both approaches should work, but MI generalizes to non-binary settings.
Config
| Parameter |
Value |
| n_bits |
20, 50 |
| k_sparse |
3, 5 |
| method |
exhaustive MI over all C(n,k) subsets |
| MI estimator |
2x2 contingency table (exact for binary {-1,+1}) |
| n_samples |
500 |
| seeds |
42, 43, 44 |
Results
| Config |
C(n,k) |
Avg MI (true) |
Avg MI (wrong) |
Avg MI gap |
Avg time |
Correct |
| n=20, k=3 |
1,140 |
0.6925 |
0.0010 |
0.6799 |
0.033s |
3/3 |
| n=50, k=3 |
19,600 |
0.6922 |
0.0010 |
0.6731 |
0.569s |
3/3 |
| n=20, k=5 |
15,504 |
0.6915 |
0.0010 |
0.6756 |
0.453s |
3/3 |
Key Table
| Method |
n=20/k=3 |
n=50/k=3 |
n=20/k=5 |
| MI (this exp) |
0.033s, 100% |
0.569s, 100% |
0.453s, 100% |
| Fourier (exp_fourier) |
0.009s, 100% |
0.13s, 100% |
1.2s, 100% |
| Random search (exp_evolutionary) |
0.011s, 100% |
0.142s, 100% |
0.426s, 100% |
| SGD (baseline) |
0.12s, 100% |
FAIL (54%) |
14 epochs (n=5000) |
| SGD + curriculum |
-- |
20 epochs, >90% |
-- |
Analysis
What worked
- MI identifies the true subset with 100% accuracy on all configs and all seeds. The true subset's MI is always within 0.002 nats of the theoretical maximum log(2) = 0.693 nats.
- The MI gap is enormous: best MI ~ 0.69 vs next-best ~ 0.01-0.02. The signal-to-noise ratio is ~40-70x, making identification trivial.
- MI solves n=50/k=3 in 0.57s, a config that SGD fails on (54% accuracy) without curriculum learning.
- Sample complexity is low: 20 samples suffice for n=20/k=3 (3/3 correct). The MI gap grows with more samples (0.35 at 20 samples, 0.68 at 500 samples) but even small gaps are enough for correct identification.
What didn't work
- MI is ~3.7x slower than Fourier on n=20/k=3 (0.033s vs 0.009s) and ~4.4x slower on n=50/k=3 (0.569s vs 0.13s). The contingency table computation (looping over 2x2 cells, computing boolean masks) is heavier than Fourier's single mean(y * parity) operation.
- MI is faster than Fourier on n=20/k=5 (0.453s vs 1.2s). This appears to be due to implementation differences rather than algorithmic advantage; both are O(C(n,k) * n_samples).
- ARD is identical to Fourier (1,147,375 vs 17,976 for SGD). Both methods stream over the entire dataset for each subset with no weight reuse.
Surprise
- MI and Fourier give identical ARD (1,147,375). This makes sense: both iterate over all C(n,k) subsets and read x and y for each one. The memory access pattern is the same streaming pattern regardless of whether you compute correlation or MI.
- Wrong subsets have MI extremely close to 0 (mean ~0.001 nats). For independent binary variables with 500 samples, the finite-sample MI bias is about 1/(2 * n_samples * ln(2)) ~ 0.001, which matches perfectly. MI correctly reports near-zero for independent variables.
- MI provides no advantage over Fourier for binary parity. Both detect the same signal (perfect correlation/dependence between parity product and label). MI's generality (detecting non-linear relationships) is unnecessary here because the relationship IS the linear product.
Open Questions
- Does MI outperform Fourier on a modified parity problem where the label is a nonlinear function of the parity (e.g., y = sign(prod(x_S) + noise))? MI should be more robust to such corruptions.
- Can kNN-based MI estimators (Kraskov) work on continuous-valued features without the discretization step?
- For large k, can MI be computed on partial subsets to prune the search space?
Files
- Experiment:
src/sparse_parity/experiments/exp_mutual_info.py
- Results:
results/exp_mutual_info/results.json