Skip to content

Experiment exp_lasso: LASSO on Interaction Features for Sparse Parity

Date: 2026-03-06 Status: SUCCESS Answers: Can L1-penalized linear regression in the interaction basis recover the secret parity subset?

Hypothesis

If we expand input x in {-1,+1}^n to all C(n,k) interaction terms (products of k input bits), the parity function becomes a single nonzero coefficient in this basis. LASSO's sparsity-inducing L1 penalty should recover exactly that one coefficient, solving sparse parity without neural nets or gradient descent.

Config

Parameter Value
n_bits 20, 50
k_sparse 3, 5
method sklearn Lasso (alpha=0.1) + LassoCV (5-fold)
n_samples 500 (k=3, n=20), 1000 (k=3, n=50), 2000 (k=5)
seeds 42, 43, 44
max_iter 10,000
fit_intercept False

Results

Metric Value
Accuracy (all configs) 100% (9/9 runs)
Nonzero coefficients Exactly 1 (all runs)
LASSO coef (alpha=0.1) 0.9000 (bias from L1 shrinkage)
LassoCV coef (alpha=0.001) 0.9990 (near-perfect)
Wall time n=20,k=3 0.28s (with CV), 0.011s (LASSO only)
Wall time n=50,k=3 11.7s (with CV), 0.33s (LASSO only)
Wall time n=20,k=5 17.3s (with CV), 0.48s (LASSO only)
Weighted ARD (n=20,k=3) 861,076

Key Table

Config Method Acc Time (LASSO only) Time (with CV) Features
n=20,k=3 LASSO 3/3 0.005s 0.28s 1,140
n=50,k=3 LASSO 3/3 0.15s 11.7s 19,600
n=20,k=5 LASSO 3/3 0.21s 17.3s 15,504
n=20,k=3 Fourier 3/3 0.009s -- 1,140
n=50,k=3 Fourier 3/3 0.16s -- 19,600
n=20,k=5 Fourier 3/3 0.14s -- 15,504
n=20,k=3 SGD 100% 0.12s -- 20 raw
n=50,k=3 SGD direct 54% -- -- FAIL
n=20,k=3 Random search 5/5 0.011s -- --

Alpha Sweep (n=20, k=3, seed=42)

Alpha Nonzero Coef Correct Time
0.001 1 0.999 Yes 0.018s
0.01 1 0.990 Yes 0.009s
0.05 1 0.950 Yes 0.012s
0.1 1 0.900 Yes 0.008s
0.2 1 0.800 Yes 0.007s
0.5 1 0.500 Yes 0.007s
1.0 0 0.000 No 0.008s

Analysis

What worked

  • Perfect accuracy on every config: 9/9 runs across all three configs and seeds. LASSO finds exactly 1 nonzero coefficient corresponding to the true secret subset.
  • LASSO is robust to alpha: any value from 0.001 to 0.5 works. Only alpha=1.0 fails (shrinks everything to zero). The problem is so well-separated that LASSO barely needs tuning.
  • LASSO only (no CV) is fast: 0.005s for n=20/k=3 and 0.15s for n=50/k=3. Competitive with Fourier when you skip cross-validation.
  • Solves n=50,k=3 trivially -- the same config where SGD fails outright (54%).
  • Mathematically elegant: the parity function y = x_{i1} * x_{i2} * x_{i3} is literally a linear function in the interaction basis. LASSO is the right tool for sparse linear recovery.

What didn't work

  • LassoCV dominates wall time: 5-fold cross-validation is expensive. For n=50/k=3, LASSO itself takes 0.15s but LassoCV adds 12.7s. For n=20/k=5, LASSO takes 0.21s but CV adds 16.9s. CV is unnecessary here given LASSO's robustness to alpha.
  • Slower than Fourier: Fourier solves n=20/k=3 in 0.009s vs LASSO's 0.005s (comparable), but Fourier handles n=50/k=3 in 0.16s vs LASSO's 0.15s + feature expansion. Once you add feature expansion overhead, they are similar for LASSO-only but Fourier wins on simplicity.
  • L1 shrinkage biases the coefficient: at alpha=0.1 the recovered coefficient is 0.90 instead of 1.0. Not a problem for identification (largest coefficient wins), but the coefficient magnitude is systematically underestimated.
  • Feature expansion is O(n_samples * C(n,k)): same combinatorial cost as Fourier. LASSO does not avoid the combinatorial blowup -- it just uses a different algorithm on the same feature matrix.
  • ARD is high (861,076): the feature expansion creates a large matrix that must be read in full. Similar cache behavior to Fourier.

Surprise

  • Exactly 1 nonzero coefficient every time: LASSO's sparsity is perfect here. In typical LASSO applications you get some spurious nonzeros, but the interaction features for different subsets are nearly orthogonal (uncorrelated when inputs are uniform {-1,+1}), so the irrepresentable condition holds and LASSO achieves exact support recovery.
  • Alpha range is enormous: anything from 0.001 to 0.5 works (500x range). The true coefficient has magnitude 1.0 and all spurious correlations are ~0, giving a massive signal-to-noise ratio. LASSO barely needs to try.
  • LassoCV always picks alpha=0.001: the smallest alpha in the default grid, because there is essentially no noise to regularize against.

Open Questions

  • Can we skip the full feature expansion and instead use a screening rule? (e.g., compute marginal correlations first, then only expand promising subsets)
  • How does LASSO scale to k=7 or k=10? C(20,7) = 77,520 and C(20,10) = 184,756 features -- the expansion itself becomes the bottleneck
  • Elastic Net (L1 + L2) comparison: does adding L2 help or hurt when the true solution is maximally sparse?
  • LASSO path (lars algorithm) could be faster than coordinate descent for this problem since there is exactly 1 nonzero

Files

  • Experiment: src/sparse_parity/experiments/exp_lasso.py
  • Results: results/exp_lasso/results.json