Experiment exp_lasso: LASSO on Interaction Features for Sparse Parity
Date: 2026-03-06
Status: SUCCESS
Answers: Can L1-penalized linear regression in the interaction basis recover the secret parity subset?
Hypothesis
If we expand input x in {-1,+1}^n to all C(n,k) interaction terms (products of k input bits), the parity function becomes a single nonzero coefficient in this basis. LASSO's sparsity-inducing L1 penalty should recover exactly that one coefficient, solving sparse parity without neural nets or gradient descent.
Config
| Parameter |
Value |
| n_bits |
20, 50 |
| k_sparse |
3, 5 |
| method |
sklearn Lasso (alpha=0.1) + LassoCV (5-fold) |
| n_samples |
500 (k=3, n=20), 1000 (k=3, n=50), 2000 (k=5) |
| seeds |
42, 43, 44 |
| max_iter |
10,000 |
| fit_intercept |
False |
Results
| Metric |
Value |
| Accuracy (all configs) |
100% (9/9 runs) |
| Nonzero coefficients |
Exactly 1 (all runs) |
| LASSO coef (alpha=0.1) |
0.9000 (bias from L1 shrinkage) |
| LassoCV coef (alpha=0.001) |
0.9990 (near-perfect) |
| Wall time n=20,k=3 |
0.28s (with CV), 0.011s (LASSO only) |
| Wall time n=50,k=3 |
11.7s (with CV), 0.33s (LASSO only) |
| Wall time n=20,k=5 |
17.3s (with CV), 0.48s (LASSO only) |
| Weighted ARD (n=20,k=3) |
861,076 |
Key Table
| Config |
Method |
Acc |
Time (LASSO only) |
Time (with CV) |
Features |
| n=20,k=3 |
LASSO |
3/3 |
0.005s |
0.28s |
1,140 |
| n=50,k=3 |
LASSO |
3/3 |
0.15s |
11.7s |
19,600 |
| n=20,k=5 |
LASSO |
3/3 |
0.21s |
17.3s |
15,504 |
| n=20,k=3 |
Fourier |
3/3 |
0.009s |
-- |
1,140 |
| n=50,k=3 |
Fourier |
3/3 |
0.16s |
-- |
19,600 |
| n=20,k=5 |
Fourier |
3/3 |
0.14s |
-- |
15,504 |
| n=20,k=3 |
SGD |
100% |
0.12s |
-- |
20 raw |
| n=50,k=3 |
SGD direct |
54% |
-- |
-- |
FAIL |
| n=20,k=3 |
Random search |
5/5 |
0.011s |
-- |
-- |
Alpha Sweep (n=20, k=3, seed=42)
| Alpha |
Nonzero |
Coef |
Correct |
Time |
| 0.001 |
1 |
0.999 |
Yes |
0.018s |
| 0.01 |
1 |
0.990 |
Yes |
0.009s |
| 0.05 |
1 |
0.950 |
Yes |
0.012s |
| 0.1 |
1 |
0.900 |
Yes |
0.008s |
| 0.2 |
1 |
0.800 |
Yes |
0.007s |
| 0.5 |
1 |
0.500 |
Yes |
0.007s |
| 1.0 |
0 |
0.000 |
No |
0.008s |
Analysis
What worked
- Perfect accuracy on every config: 9/9 runs across all three configs and seeds. LASSO finds exactly 1 nonzero coefficient corresponding to the true secret subset.
- LASSO is robust to alpha: any value from 0.001 to 0.5 works. Only alpha=1.0 fails (shrinks everything to zero). The problem is so well-separated that LASSO barely needs tuning.
- LASSO only (no CV) is fast: 0.005s for n=20/k=3 and 0.15s for n=50/k=3. Competitive with Fourier when you skip cross-validation.
- Solves n=50,k=3 trivially -- the same config where SGD fails outright (54%).
- Mathematically elegant: the parity function y = x_{i1} * x_{i2} * x_{i3} is literally a linear function in the interaction basis. LASSO is the right tool for sparse linear recovery.
What didn't work
- LassoCV dominates wall time: 5-fold cross-validation is expensive. For n=50/k=3, LASSO itself takes 0.15s but LassoCV adds 12.7s. For n=20/k=5, LASSO takes 0.21s but CV adds 16.9s. CV is unnecessary here given LASSO's robustness to alpha.
- Slower than Fourier: Fourier solves n=20/k=3 in 0.009s vs LASSO's 0.005s (comparable), but Fourier handles n=50/k=3 in 0.16s vs LASSO's 0.15s + feature expansion. Once you add feature expansion overhead, they are similar for LASSO-only but Fourier wins on simplicity.
- L1 shrinkage biases the coefficient: at alpha=0.1 the recovered coefficient is 0.90 instead of 1.0. Not a problem for identification (largest coefficient wins), but the coefficient magnitude is systematically underestimated.
- Feature expansion is O(n_samples * C(n,k)): same combinatorial cost as Fourier. LASSO does not avoid the combinatorial blowup -- it just uses a different algorithm on the same feature matrix.
- ARD is high (861,076): the feature expansion creates a large matrix that must be read in full. Similar cache behavior to Fourier.
Surprise
- Exactly 1 nonzero coefficient every time: LASSO's sparsity is perfect here. In typical LASSO applications you get some spurious nonzeros, but the interaction features for different subsets are nearly orthogonal (uncorrelated when inputs are uniform {-1,+1}), so the irrepresentable condition holds and LASSO achieves exact support recovery.
- Alpha range is enormous: anything from 0.001 to 0.5 works (500x range). The true coefficient has magnitude 1.0 and all spurious correlations are ~0, giving a massive signal-to-noise ratio. LASSO barely needs to try.
- LassoCV always picks alpha=0.001: the smallest alpha in the default grid, because there is essentially no noise to regularize against.
Open Questions
- Can we skip the full feature expansion and instead use a screening rule? (e.g., compute marginal correlations first, then only expand promising subsets)
- How does LASSO scale to k=7 or k=10? C(20,7) = 77,520 and C(20,10) = 184,756 features -- the expansion itself becomes the bottleneck
- Elastic Net (L1 + L2) comparison: does adding L2 help or hurt when the true solution is maximally sparse?
- LASSO path (lars algorithm) could be faster than coordinate descent for this problem since there is exactly 1 nonzero
Files
- Experiment:
src/sparse_parity/experiments/exp_lasso.py
- Results:
results/exp_lasso/results.json