Experiment exp_rl: Reinforcement Learning Bit Querying for Sparse Parity¶
Date: 2026-03-06 Status: SUCCESS Answers: Can an RL agent learn which k bits to query, reframing sparse parity as "learning what to look at"?
Hypothesis¶
An RL agent can learn to query exactly the k secret bits by observing rewards for correct/incorrect parity predictions. Two formulations: 1. Bandit over k-subsets (UCB1): each arm is a k-subset, pull = evaluate accuracy on a batch, UCB1 for exploration. Should find the correct arm after ~C(n,k) pulls (one per arm). 2. Sequential bit-querying (Q-learning): at each step pick a bit to query, after k queries predict label = product of values. The agent must learn the optimal query policy through trial and error.
Config¶
| Parameter | Value |
|---|---|
| n_bits | 10, 20 |
| k_sparse | 3 |
| n_train | 500 |
| seeds | 42, 43, 44 |
| Bandit episodes | 1000 (n=10), 3000 (n=20) |
| Seq QL episodes | 20000 (n=10), 50000 (n=20) |
| Seq QL alpha | 0.2 |
| Seq QL epsilon | 1.0 -> 0.02 (decay 0.9995) |
Results¶
| Config | C(n,k) | Bandit: found at ep | Bandit: time | SeqQL: converged at ep | SeqQL: time | Both correct |
|---|---|---|---|---|---|---|
| n=10, k=3 | 120 | 120 (all seeds) | 0.034s avg | 3656, 4169, 4377 | 0.61s avg | 6/6 |
| n=20, k=3 | 1,140 | 1140 (all seeds) | 0.12s avg | 9526, 11907, 5122 | 2.40s avg | 6/6 |
Memory Efficiency¶
| Config | Method | ARD (floats) | Reads | Writes | Reads per prediction |
|---|---|---|---|---|---|
| n=10, k=3 | Bandit UCB | 16,022 | 1,000 | 1,000 | k*batch = 150 |
| n=10, k=3 | Seq QL | 1 | 60,000 | 60,000 | k = 3 |
| n=20, k=3 | Bandit UCB | 83,396 | 3,000 | 3,000 | k*batch = 150 |
| n=20, k=3 | Seq QL | 1 | 150,000 | 150,000 | k = 3 |
Analysis¶
What worked¶
- Both approaches recover the exact secret bits in all 6 runs (3 seeds x 2 configs), with 100% test accuracy.
- Bandit UCB is reliable and fast: finds the correct arm after exactly C(n,k) episodes (one per arm in the initialization sweep). The UCB exploration bonus then confirms the best arm. Wall time is 0.03-0.12s.
- Sequential Q-learning succeeds with value-blind state: the critical design decision is using only the set of queried bit indices as state (not their values). This reduces the state space from exponential to sum_{j=0}^{k-1} C(n,j), which is 56 for n=10/k=3 and 211 for n=20/k=3.
- Sequential agent achieves minimal ARD: each prediction reads exactly k=3 bits. The ARD of 1 float means each bit read is immediately used (no cache miss). This is the theoretical minimum for energy-efficient parity computation.
What didn't work¶
- Value-aware Q-learning fails completely: the initial implementation tracked (queried_bits, queried_values) as state, giving ~50% accuracy (random chance). The state space explodes to O(n^k * 2^k) and the Q-table never converges.
- Sequential Q-learning is slower than bandit: 4000-12000 episodes to converge vs. ~C(n,k) for bandit. The sequential agent must learn a policy over k steps, whereas the bandit directly evaluates complete subsets.
- Bandit UCB has high ARD: each arm pull evaluates a batch of 50 samples, reading k*50=150 floats per pull. This is necessary for accurate reward estimation but is much less memory-efficient than the sequential agent's 3 reads per prediction.
Key insight: the information-theoretic view¶
The bandit finds the correct arm after exactly C(n,k) pulls because it must try each arm at least once. This matches the information-theoretic lower bound: distinguishing 1 correct subset from C(n,k)-1 wrong ones requires at least C(n,k) observations. The sequential agent is slower because it explores incrementally, but once converged, it achieves the optimal k reads per prediction.
Comparison with other approaches¶
| Method | n=10/k=3 time | n=20/k=3 time | Reads per prediction |
|---|---|---|---|
| Bandit UCB | 0.034s | 0.12s | k*batch (150) |
| Sequential QL | 0.61s | 2.40s | k (3) -- optimal |
| Random search (exp_evolutionary) | ~0.01s | ~0.01s | n_train*k (1500) |
| SGD neural net | 0.12s | 0.12s | n (all bits) |
The sequential QL agent is the only approach that achieves the theoretical minimum of k reads per prediction at inference time. All other methods read more data per prediction.
Open Questions¶
- Can the sequential agent scale to k=5 or larger? The value-blind state space grows as sum C(n,j) for j < k, which is still polynomial for small k.
- A neural network policy (DQN) could replace the Q-table and potentially handle larger n by generalizing across states.
- Hybrid: use bandit to identify candidate subsets, then train a sequential agent restricted to top candidates.
Files¶
- Experiment:
src/sparse_parity/experiments/exp_rl.py - Results:
results/exp_rl/results.json