Per-stub reproducibility, run wallclock, and headline result for the 58 implementations shipped across wave PRs. Compiled from PR bodies and per-stub READMEs for the v2 data-movement / ByteDMD filter.
Reproduces? legend : yes = matches paper qualitatively or quantitatively; partial / qualitative = method works, paper number not fully reached (gap documented in stub README); no = paper claim does not replicate (gap analysis documented).
Run wallclock : time to run the final headline experiment on a laptop M-series CPU. Numpy + matplotlib only, no GPU.
Stub Reproduces? Run wallclock Headline
nbb-xor/ (PR #5)qualitative 0.85s 19/20 seeds solve XOR; mean 3012 presentations vs paper ~619
nbb-moving-light/ (PR #6)yes 0.03s mean 223 presentations matches paper exactly; 9/30 solve rate vs paper 9/10
Stub Reproduces? Run wallclock Headline
flip-flop/ (PR #6)yes 3-5s 10/10 sequential (paper 6/10); 30/30 parallel (paper 20/30)
pole-balance-non-markov/ (PR #6)yes 9.5s seed 0: 30/30 episodes balance full 1000 steps
Stub Reproduces? Run wallclock Headline
pole-balance-markov-vac/ (PR #6)yes 1.21s K=2 vector critic; 173 episodes; 9/10 multi-seed
Stub Reproduces? Run wallclock Headline
curiosity-three-regions/ (PR #7)yes 0.5s visit ordering C > B > A across 10 seeds (C=42.8%, B=33.3%, A=23.9%)
Stub Reproduces? Run wallclock Headline
subgoal-obstacle-avoidance/ (PR #7)yes 6.4s 99% success seed 0 vs 0% no-sub-goal baseline (10-seed mean 98.5%)
Stub Reproduces? Run wallclock Headline
pomdp-flag-maze/ (PR #7)partial 22-32s 6/10 seeds 100% solve, 4/10 stuck at 50%
Stub Reproduces? Run wallclock Headline
chunker-22-symbol/ (PR #8)yes 1.86s 99.5% label accuracy 10/10 seeds; A-alone baseline at chance
Stub Reproduces? Run wallclock Headline
fast-weights-unknown-delay/ (PR #8)yes 3s 100% bit-accuracy K=5-30 trained / K=1-60 extrapolation; 10/10 seeds
fast-weights-key-value/ (PR #8)yes 0.07s retrieval cosine 0.428 → 0.754 (1.76× lift); numerical grad-check <1e-9
Stub Reproduces? Run wallclock Headline
predictability-min-binary-factors/ (PR #9)yes 2.8s predictors collapse to chance (L_pred = 0.2500 exact); pairwise MI 9.6e-5 nats; 8/8 seeds 100% bit-recovery
Stub Reproduces? Run wallclock Headline
predictable-stereo/ (PR #9)yes 0.08s I(yL; yR) = 7.598 nats; depth recovery 1.000 seed 0; 8/8 seeds at 0.997 mean
Stub Reproduces? Run wallclock Headline
self-referential-weight-matrix/ (PR #8)partial 4.5s 99.6% on 4-way boolean meta-learning (AND/OR/XOR/NAND); 8/8 seeds > 0.95
Stub Reproduces? Run wallclock Headline
chunker-very-deep-1200/ (PR #8)yes 29.8s 599.5× depth-reduction at T=1200; chunker 100% recall vs single-net 0% (gradient vanishes by t=4)
Stub Reproduces? Run wallclock Headline
levin-count-inputs/ (PR #4)yes 1.0s 5-instr popcount routine; 770k programs enumerated; 200/200 generalize
levin-add-positions/ (PR #4)yes 0.34s 3-instr im+ (length-3); 58 evaluations; 200/200 generalize
Stub Reproduces? Run wallclock Headline
rs-two-sequence/ (PR #4)yes 0.94s 30/30 seeds solve, median 144 trials vs paper ~718
rs-parity/ (PR #4)yes 15.3s N=50 seed 0: 10,253 trials; N=500 seed 0: 412 trials / 3.2s
rs-tomita/ (PR #4)yes 17-19s #1, #2, #4 all solved across 10 seeds (within ~3× of paper for #1/#2; ~6× for #4)
Stub Reproduces? Run wallclock Headline
adding-problem/ (PR #10)yes 39s LSTM MSE 0.0007 (50× under paper threshold 0.04); vanilla RNN MSE 0.0706; 5/5 seeds clear; gradient check 1.6e-7
embedded-reber/ (PR #10)yes 2.6s 10/10 seeds, mean 4800 sequences vs paper 8440 (1.8× faster with Adam)
noise-free-long-lag/ (PR #10)qualitative 21s sub-variant (a) at p=50: solved at seq 600, 100% acc; 6/10 seeds (b)/(c) deferred
two-sequence-noise/ (PR #10)yes 32s variant 3c only: 4/4 seeds 100% (~3k seqs vs paper ~269k SGD)
multiplication-problem/ (PR #10)yes 4.5s LSTM MSE 0.0028 / 17× chance baseline; 3/5 seeds (paper-faithful per-seed brittleness)
temporal-order-3bit/ (PR #10)yes 24s 5/5 seeds 100%, median ~6.4k seqs vs paper 31,390 (Adam advantage); vanilla RNN at chance 0.25
Stub Reproduces? Run wallclock Headline
pipe-symbolic-regression/ (PR #12)yes 1.3s seed 3 finds Koza target x + x² + x³ + x⁴ exactly at gen 60; 6/20 seeds Koza-hit-solve
pipe-6-bit-parity/ (PR #12)yes 240s 4-bit clean solve at gen 258; 6-bit partial 71.9% at 240s budget cap
Stub Reproduces? Run wallclock Headline
ssa-bias-transfer-mazes/ (PR #7)yes 1.7s SSA tail solve 0.83 vs no-SSA 0.70 (+19% relative); seed 0 task 2 SSA 8.12 steps vs no-SSA 60 steps
Stub Reproduces? Run wallclock Headline
hq-learning-pomdp/ (PR #7)no 21s Honest non-replication: paper’s HQ-vs-flat gap doesn’t reproduce on 29-cell maze; mathematical analysis (γ^Δt · HV ≤ R_goal bound prevents per-corridor specialization) in §Open questions
Stub Reproduces? Run wallclock Headline
semilinear-pm-image-patches/ (PR #9)yes 1.2s 12/16 oriented filters (FFT concentration > 0.5); kurtosis 19.96 vs random 2.95; analytic-vs-numerical gradient max 5e-10
Stub Reproduces? Run wallclock Headline
lococode-ica/ (PR #9)qualitative 0.4s Amari 0.117 mean over 10 seeds — 4× better than PCA (0.388), within 5× of FastICA (0.022)
Stub Reproduces? Run wallclock Headline
continual-embedded-reber/ (PR #11)yes 14s 5/5 forget-gate seeds solve (99.7% mean) vs 5/5 no-forget at chance (55%); cell-state norm 25 vs 295
Stub Reproduces? Run wallclock Headline
anbn-anbncn/ (PR #11)yes 35s a^n b^n trained n=1..10 → generalizes to n=1..65 (3/5 seeds); a^n b^n c^n → n=1..29; gradcheck 5.66e-6
Stub Reproduces? Run wallclock Headline
timing-counting-spikes/ (PR #11)partial 32s Peephole seed 4: MSE 0.00073 / solve 0.998 vs vanilla 0.00240 / 0.900; cross-seed gap small (paper’s “vanilla fails all” doesn’t fully reproduce at short-MSD)
Stub Reproduces? Run wallclock Headline
blues-improvisation/ (PR #11)qualitative 12s 12/12 bar-onset chord match; step-chord 0.906; on-beat 0.792; chord-tone 0.877
Stub Reproduces? Run wallclock Headline
evolino-sines-mackey-glass/ (PR #12)partial 140s sines free-run MSE 0.181 (horizon 299); MG NRMSE@84 = 0.291 vs paper 1.9e-3 (whole-genome simplification of full ESP)
Stub Reproduces? Run wallclock Headline
double-pole-no-velocity/ (PR #12)yes 60s seed 0 solved at gen 27 / ~60s; 7/10 seeds 20/20 generalize at pop=40 (~5× cheaper than paper’s pop=200)
Stub Reproduces? Run wallclock Headline
timit-blstm-ctc/ (PR #15)qualitative 73s synthetic phoneme corpus (K=6); BLSTM 1.87× faster than uni-LSTM (5/5 seeds 300 vs 560 iters); gradcheck 1.12e-7
Stub Reproduces? Run wallclock Headline
iam-handwriting/ (PR #15)qualitative 103s synthetic 10-char alphabet; in-vocab CER 0.082 / word acc 0.77; held-out compositional CER 0.647
Stub Reproduces? Run wallclock Headline
oops-towers-of-hanoi/ (PR #4)yes 0.25s 6-token recursive Hanoi solver SD C SD M SA C; reuse from n=4 onward; verified through n=15
Stub Reproduces? Run wallclock Headline
mnist-deep-mlp/ (PR #13)partial 79s 1.17% test err / 15 epochs; 535k MLP vs paper 12M-weight nets at 800 epochs (0.35%)
Stub Reproduces? Run wallclock Headline
mcdnn-image-bench/ (PR #13)partial 22.2s 1.46% MNIST single-column MLP (no aug); paper 35-column ensemble 0.23%
Stub Reproduces? Run wallclock Headline
em-segmentation-isbi/ (PR #15)qualitative 1.5s Synthetic Voronoi-EM substitute; ROC AUC 0.989 vs Sobel+intensity 0.880; pixel acc 95.97%
Stub Reproduces? Run wallclock Headline
compete-to-compute/ (PR #13)qualitative 0.8s Seed 0: LWTA forgetting 0.022 vs ReLU 0.072 (3.3× less); 10-seed: LWTA wins 6/10 (small-net regime noisy)
Stub Reproduces? Run wallclock Headline
highway-networks/ (PR #13)yes 7s Depth 30: highway 0.926 vs plain 0.124 (chance); plain dies past depth 10; highway stable 5-50
Stub Reproduces? Run wallclock Headline
lstm-search-space-odyssey/ (PR #15)yes 145s All 8 LSTM variants implemented; CIFG 1st, NIG last across 3/3 seeds; gradient check 1.31e-7
Stub Reproduces? Run wallclock Headline
clockwork-rnn/ (PR #15)yes 22s Synthetic sum-of-sines T=320, periods 8/32/80/160; CW-RNN 0.117 vs vanilla 0.250 (2.22× over 5 seeds); multi-rate decomposition in per-group FFT
Stub Reproduces? Run wallclock Headline
torcs-vision-evolution/ (PR #15)yes 45.5s Numpy oval track + 16×16 obs + DCT-parameterized W1; 14.3× compression (4129 raw → 289 DCT); 5/5 seeds solve in ≤50s
Stub Reproduces? Run wallclock Headline
neural-em-shapes/ (PR #14)partial 17s K=3 slot N-EM, manual BPTT through T=4 EM iterations; best test NMI 0.428 epoch 7 (chance 0.33); paper AMI 0.96
Stub Reproduces? Run wallclock Headline
relational-nem-bouncing-balls/ (PR #14)qualitative 24.8s Velocity-MSE: relational wins K=3,4,5 (0.81×, 0.92×, 0.97×); loses K=6 (1.01× — distribution shift dominates)
Stub Reproduces? Run wallclock Headline
world-models-carracing/ (PR #15)yes 6.5s Numpy 2D track; V+M+C +103.8 mean across 5/5 seeds (random +4.84, ~21× random)
world-models-vizdoom-dream/ (PR #15)yes 20s Numpy 5×5 gridworld; controller trained ENTIRELY in M’s dream → zero-shot real-env transfer (49.1 vs random 22.4, 2.2× random)
Stub Reproduces? Run wallclock Headline
upside-down-rl/ (PR #14)yes 3.5s Numpy 9-state chain MDP (per SPEC, not LunarLander); 5/5 seeds reach +4.70 at R*=5.0; achieved monotonically tracks commanded
Stub Reproduces? Run wallclock Headline
linear-transformers-fwp/ (PR #14)yes 0.08s Equivalence verified to 2.22e-16 (float64 ulp) : V^T(Kq) ≡ (V^T K)q. Pre-train cos 0.428 → post 0.754 (1.76×); delta-rule peaks +0.05 above sum-rule at N=6
Stub Reproduces? Run wallclock Headline
neural-data-router/ (PR #14)partial 3:30 Test depth 5: NDR 0.60 vs vanilla 0.32 (chance 0.25); 3-seed NDR 0.405 ± 0.013 vs vanilla 0.296 ± 0.031 (NDR wins 3/3)
Reproduces? Count Examples
yes 32 nbb-moving-light, flip-flop, embedded-reber, fast-weights-key-value, oops-towers-of-hanoi, linear-transformers-fwp, world-models-carracing, …
partial 12 self-referential-weight-matrix, mnist-deep-mlp, mcdnn-image-bench, evolino-sines-mackey-glass, neural-em-shapes, neural-data-router, …
qualitative 13 nbb-xor, noise-free-long-lag, lococode-ica, blues-improvisation, em-segmentation-isbi, compete-to-compute, timit-blstm-ctc, iam-handwriting, …
no 1 hq-learning-pomdp (honest non-replication; mathematical analysis documented)
Total: 58 stubs implemented, all in pure numpy + matplotlib, all <5 min/seed on a laptop except pipe-6-bit-parity (240s 6-bit budget cap), evolino-sines-mackey-glass (140s).
For the data-movement / ByteDMD instrumentation, prioritize stubs that:
Pure-numpy mini-environments + sub-second runs: linear-transformers-fwp (0.08s), predictable-stereo (0.08s), levin-add-positions (0.34s), lococode-ica (0.4s), compete-to-compute (0.8s), nbb-xor (0.85s), rs-two-sequence (0.94s), levin-count-inputs (1.0s), semilinear-pm-image-patches (1.2s), pipe-symbolic-regression (1.3s), em-segmentation-isbi (1.5s), ssa-bias-transfer-mazes (1.7s), chunker-22-symbol (1.86s), predictability-min-binary-factors (2.8s).
Verified-by-gradient-check (numerical-vs-analytical < 1e-6): fast-weights-unknown-delay, fast-weights-key-value, temporal-order-3bit, temporal-order-4bit, adding-problem, noise-free-long-lag, clockwork-rnn, lstm-search-space-odyssey, anbn-anbncn, timit-blstm-ctc, self-referential-weight-matrix.
adding-problem family : vanilla RNN vs LSTM (paper’s contrast, both implemented in adding-problem and temporal-order-3bit).
temporal-order family : 3-bit vs 4-bit, 4-class vs 8-class on identical architecture.
embedded-reber family : original 1997 LSTM (no forget) vs forget-gate LSTM (continual-embedded-reber).
LSTM ablation matrix : lstm-search-space-odyssey runs 8 variants on the same task — V/NIG/NFG/NOG/NIAF/NOAF/CIFG/NP — direct architectural-variant data-movement comparison built in.
Linear-attention ↔ FWP : linear-transformers-fwp IS the equivalence demo; fast-weights-key-value is the 1992 ancestor; ByteDMD on both should produce identical numbers.
Evolutionary methods : pipe-symbolic-regression (PIPE), evolino-sines-mackey-glass (Evolino), double-pole-no-velocity (ESP), torcs-vision-evolution (DCT-compressed natural ES) — gradient-free family for compare-vs-gradient-based data-movement.
Search methods : levin-count-inputs, levin-add-positions (Levin), oops-towers-of-hanoi (OOPS), rs-* (random search) — all gradient-free.
World models : world-models-carracing and world-models-vizdoom-dream share V+M+C decomposition — three distinct training stages with very different memory access patterns.
Stubs with run wallclock > 100s where v2 ByteDMD overhead would dominate: pipe-6-bit-parity (240s 6-bit), evolino-sines-mackey-glass (140s), lstm-search-space-odyssey (145s).
Honest non-replications where measuring data-movement on a non-converged solver isn’t informative: hq-learning-pomdp (paper’s HQ-vs-flat gap doesn’t reproduce on this maze size).
Partial reproductions where the v1.5 path needs to close first: neural-em-shapes (no background slot), mnist-deep-mlp (smaller MLP), mcdnn-image-bench (single-column).
Each stub’s §Open questions section flags stub-specific follow-ups. Repository-wide follow-ups:
Original-simulator reruns (RL/env-heavy stubs): close the loop on gym CarRacing-v0, VizDoom DoomTakeCover, TORCS, TIMIT, IAM, ISBI. Currently all 8 use numpy mini-environments per the SPEC’s RL-stub rule.
Paper-scale reruns for partial reproductions: full paper-scale mnist-deep-mlp (12M weights, 800 epochs); 35-column ensemble for mcdnn-image-bench; full ESP for evolino-sines-mackey-glass; T ≥ 300 for timing-counting-spikes.
ByteDMD instrumentation (the actual research goal): prioritize the v2-filter recommendations above.
Compiled by agent-0bserver07 (Claude Code) on behalf of Yad. Source: PR bodies #4-#15 + per-stub READMEs.