Skip to content

Experiment exp_rl: Reinforcement Learning Bit Querying for Sparse Parity

Date: 2026-03-06 Status: SUCCESS Answers: Can an RL agent learn which k bits to query, reframing sparse parity as "learning what to look at"?

Hypothesis

An RL agent can learn to query exactly the k secret bits by observing rewards for correct/incorrect parity predictions. Two formulations: 1. Bandit over k-subsets (UCB1): each arm is a k-subset, pull = evaluate accuracy on a batch, UCB1 for exploration. Should find the correct arm after ~C(n,k) pulls (one per arm). 2. Sequential bit-querying (Q-learning): at each step pick a bit to query, after k queries predict label = product of values. The agent must learn the optimal query policy through trial and error.

Config

Parameter Value
n_bits 10, 20
k_sparse 3
n_train 500
seeds 42, 43, 44
Bandit episodes 1000 (n=10), 3000 (n=20)
Seq QL episodes 20000 (n=10), 50000 (n=20)
Seq QL alpha 0.2
Seq QL epsilon 1.0 -> 0.02 (decay 0.9995)

Results

Config C(n,k) Bandit: found at ep Bandit: time SeqQL: converged at ep SeqQL: time Both correct
n=10, k=3 120 120 (all seeds) 0.034s avg 3656, 4169, 4377 0.61s avg 6/6
n=20, k=3 1,140 1140 (all seeds) 0.12s avg 9526, 11907, 5122 2.40s avg 6/6

Memory Efficiency

Config Method ARD (floats) Reads Writes Reads per prediction
n=10, k=3 Bandit UCB 16,022 1,000 1,000 k*batch = 150
n=10, k=3 Seq QL 1 60,000 60,000 k = 3
n=20, k=3 Bandit UCB 83,396 3,000 3,000 k*batch = 150
n=20, k=3 Seq QL 1 150,000 150,000 k = 3

Analysis

What worked

  • Both approaches recover the exact secret bits in all 6 runs (3 seeds x 2 configs), with 100% test accuracy.
  • Bandit UCB is reliable and fast: finds the correct arm after exactly C(n,k) episodes (one per arm in the initialization sweep). The UCB exploration bonus then confirms the best arm. Wall time is 0.03-0.12s.
  • Sequential Q-learning succeeds with value-blind state: the critical design decision is using only the set of queried bit indices as state (not their values). This reduces the state space from exponential to sum_{j=0}^{k-1} C(n,j), which is 56 for n=10/k=3 and 211 for n=20/k=3.
  • Sequential agent achieves minimal ARD: each prediction reads exactly k=3 bits. The ARD of 1 float means each bit read is immediately used (no cache miss). This is the theoretical minimum for energy-efficient parity computation.

What didn't work

  • Value-aware Q-learning fails completely: the initial implementation tracked (queried_bits, queried_values) as state, giving ~50% accuracy (random chance). The state space explodes to O(n^k * 2^k) and the Q-table never converges.
  • Sequential Q-learning is slower than bandit: 4000-12000 episodes to converge vs. ~C(n,k) for bandit. The sequential agent must learn a policy over k steps, whereas the bandit directly evaluates complete subsets.
  • Bandit UCB has high ARD: each arm pull evaluates a batch of 50 samples, reading k*50=150 floats per pull. This is necessary for accurate reward estimation but is much less memory-efficient than the sequential agent's 3 reads per prediction.

Key insight: the information-theoretic view

The bandit finds the correct arm after exactly C(n,k) pulls because it must try each arm at least once. This matches the information-theoretic lower bound: distinguishing 1 correct subset from C(n,k)-1 wrong ones requires at least C(n,k) observations. The sequential agent is slower because it explores incrementally, but once converged, it achieves the optimal k reads per prediction.

Comparison with other approaches

Method n=10/k=3 time n=20/k=3 time Reads per prediction
Bandit UCB 0.034s 0.12s k*batch (150)
Sequential QL 0.61s 2.40s k (3) -- optimal
Random search (exp_evolutionary) ~0.01s ~0.01s n_train*k (1500)
SGD neural net 0.12s 0.12s n (all bits)

The sequential QL agent is the only approach that achieves the theoretical minimum of k reads per prediction at inference time. All other methods read more data per prediction.

Open Questions

  • Can the sequential agent scale to k=5 or larger? The value-blind state space grows as sum C(n,j) for j < k, which is still polynomial for small k.
  • A neural network policy (DQN) could replace the Q-table and potentially handle larger n by generalizing across states.
  • Hybrid: use bandit to identify candidate subsets, then train a sequential agent restricted to top candidates.

Files

  • Experiment: src/sparse_parity/experiments/exp_rl.py
  • Results: results/exp_rl/results.json