Visual tour

A picture-first walk through all 58 v1+v1.5 implementations. The README has a 4-GIF teaser and the result tables; this page is the long form — every stub, in catalog order, with its training animation and a short note on what the visualization is meant to show.

For per-stub metrics (run wallclock, headline numbers) see RESULTS.md. For the experimental design of any single stub, follow its folder link to that folder’s README.md.

How to read this page

GIFs vs static figures. Each stub commits an animated GIF (<slug>.gif) of training and a viz/ folder of static PNGs. The GIF exists to show learning dynamics — order-of-emergence, plateaus, phase-transitions, controller rollouts. The static PNGs in viz/ exist to show the final state in higher resolution: training curves, weight matrices, attention maps, attractor portraits.

Algorithmic faithfulness. Every stub uses the actual algorithm the paper introduces — NBB local rule, BPTT through LSTM cells, peephole LSTM, PIPE on a probabilistic prototype tree, ESP co-evolution, FWP outer-product writes, Levin universal search, etc. The §Deviations section in each stub’s README enumerates every place the implementation deviates from the paper’s specifics (architecture sizes, optimizer choice, dataset substitution).

RL-stub rule. Per the SPEC, RL/env-heavy stubs use numpy mini-environments that capture the algorithmic claim of the original paper, not the original simulator. Affects pole-balance-*, pomdp-flag-maze, world-models-*, torcs-vision-evolution, upside-down-rl, double-pole-no-velocity. Always documented in §Deviations.

1980s — Local rules and the Neural Bucket Brigade
1990 — Controller + world-model + flip-flop
1991 — Curiosity, subgoals, the chunker
1992 — Neural Computation triple
1993 — Predictable classifications, self-reference, very deep chunking
1995–1997 — Levin search and the LSTM benchmark suite
Mid-90s — Evolutionary, RL, and feature detection
2000–2002 — LSTM follow-ups
2002–2010 — Evolutionary RL, OOPS, BLSTM+CTC
2010–2017 — Deep learning at scale
2018–2025 — World models, fast-weight Transformers, systematic generalization

1980s — Local rules and the Neural Bucket Brigade

Schmidhuber (1989) — A local learning algorithm for dynamic feedforward and recurrent networks

nbb-xor

XOR via the Neural Bucket Brigade — a strictly local-in-space-and-time, winner-take-all, dissipative learning rule. There is no backprop, no RTRL, no gradient. The wave-0 sanity validator: WTA + bucket-brigade dissipation, demonstrating that a local credit-assignment rule can solve XOR before applying it to recurrent tasks.

nbb-moving-light

1-D moving-light direction discrimination via the same NBB rule extended to a small fully-recurrent net (5 retina cells + bias → 2 output units forming a WTA subset). The redistribution denominator sums over both feedforward AND recurrent predecessors of each output (substance conservation across the recurrent loop).

1990 — Controller + world-model + flip-flop

Schmidhuber (1990) — Making the world differentiable

flip-flop

The 1990 paper sets up a tiny non-stationary control task that has all the ingredients of the long-time-lag problem Hochreiter would later formalise as the vanishing-gradient barrier. Two-network setup: world-model M predicts pain from (obs, action); controller C trained by BP through frozen M to reduce future pain. Pain is the only feedback signal — no labeled targets to C.

pole-balance-non-markov

Cart-pole balancing where the controller observes only positions, not velocities. The 4-D real state is (x, x_dot, θ, θ_dot), but C only sees (x, θ). M predicts next observed positions from action + history; C trained by BP through M’s gradient. Iterative model-learning cycles (3×) — without them, balance caps at ~150 steps; with them, full 1000-step balance.

Schmidhuber (1990) — Recurrent networks adjusted by adaptive critics

pole-balance-markov-vac

Standard cart-pole, Markov regime: the controller observes the full state at every step. K=2 vector-valued critic with two qualitatively distinct components (V_pole saturates near 1/(1-γ)=100; V_cart tracks live 1−|x|/2.4 margin). The vector critic is the paper’s central claim — generalisation of scalar AHC.

Schmidhuber & Huber (1990) — Learning to generate focus trajectories

saccadic-target-detection

Active visual attention. The controller must move a small fovea over a 2-D scene to find a target halo, given only the local pixels under the fovea. C is feedforward; M predicts the change in halo at the next fovea position. Bilinear centroid ⊗ action feature in M’s input + Δhalo regression target was the key fix (binary indicator gives ~2% positive rate, zero useful gradient).

1991 — Curiosity, subgoals, the chunker

Schmidhuber (1991) — Adaptive confidence and adaptive curiosity

curiosity-three-regions

A 1-D environment partitioned into three regions: deterministic / random / learnable-but-unlearned. Curiosity reward = windowed reduction in M’s prediction error. Visit ordering C > B > A holds 100% across 10 seeds — the agent gravitates to the learnable-but-unlearned region.

Schmidhuber (1991) — Learning to generate sub-goals for action sequences

subgoal-obstacle-avoidance

Hierarchical RL: a sub-goal generator C_high proposes K=2 waypoints, a low-level controller C_low (intentionally obstacle-blind, input = rel_target only) steers toward each. Cost gradient flows through a closed-form differentiable cost-model M back into C_high. 99% success vs 0% no-sub-goal direct baseline.

Schmidhuber (1991) — Reinforcement learning in Markovian and non-Markovian environments

pomdp-flag-maze

A 2-D T-maze with a hidden flag. The agent observes only its local 4-wall context plus a 1-bit indicator that is non-zero ONLY at the start cell. Recurrent M+C architecture must latch the indicator across the full episode. 6/10 seeds 100% solve, 4/10 stuck at 50% — likely a recurrent-init sensitivity flagged in §Open questions.

Schmidhuber (1991/1992) — Neural sequence chunkers

chunker-22-symbol

22-symbol alphabet streamed without episode boundaries. Two-network history compression: automatizer A predicts next symbol; chunker C only receives A’s prediction failures (surprises). The 20-step lag bridge that vanilla BPTT/RTRL fails on.

1992 — Neural Computation triple

Schmidhuber (1992) — Learning to control fast-weight memories

fast-weights-unknown-delay

Two arbitrary input signals must be associated across a time gap of unknown length. Slow programmer net S (917 params, 4 heads: key/value/query/gate); W_fast updated as W_fast += eta · g_t · outer(v_t, k_t). Sigmoid gate makes “load and hold” readable; 100% bit-accuracy K=5-30 trained / K=1-60 extrapolation.

fast-weights-key-value

A sequence of (key, value) pairs is presented one step at a time. Each step writes an outer-product update into a fast weight matrix. Retrieval = W_fast · k_query. The linear-Transformer ancestor — Schlag/Irie/Schmidhuber 2021 (see linear-transformers-fwp in 2018–2025) prove this is identical to linear self-attention.

Schmidhuber (1992) — Learning factorial codes by predictability minimization

predictability-min-binary-factors

Given an observable x produced by a fixed random linear mixing of K independent binary factors, learn an encoder E: x → y that produces a factorial code. Adversarial setup: encoder maximizes per-component predictor MSE; predictors minimize it. Proto-GAN math, 22 years before Goodfellow 2014. Predictors collapse to chance (L_pred = 0.2500 exact for sigmoid binary).

1993 — Predictable classifications, self-reference, very deep chunking

Schmidhuber & Prelinger (1993) — Discovering predictable classifications

predictable-stereo

Predictability maximization — the dual of PM. Two networks each see one view of the same synthetic stereo scene; their job is to produce scalar codes that maximally agree. The only thing the two views share is a hidden binary depth bit, so maximizing agreement forces them to recover it. Becker-Hinton-style IMAX.

Schmidhuber (1993) — A self-referential weight matrix

self-referential-weight-matrix

A recurrent network whose weight matrix is itself part of the state. W_eff = W_slow + W_fast. Slow params trained by BPTT across episodes; fast plastic matrix is reset each episode and rewritten by the network’s own outputs every step. 4-way boolean meta-learning (AND/OR/XOR/NAND): 99.6% query accuracy, manual BPTT gradient check at 8e-7.

Schmidhuber (1993) — Habilitationsschrift

chunker-very-deep-1200

The Habilitationsschrift’s “very deep learning” demonstration: the two-network neural sequence chunker doing credit assignment over roughly 1200 unrolled time-steps. Effective BPTT depth T - 1 = 1199 (raw) compresses to 2 (chunker on surprises). 599.5× depth-reduction at T=1200.

1995–1997 — Levin search and the LSTM benchmark suite

Schmidhuber (1995/1997) — Discovering solutions with low Kolmogorov complexity

levin-count-inputs

Find a program that maps a 100-bit input to its popcount from only 3 training examples — without gradient descent. Levin search enumerates programs ordered by len(p) + log(t). Found program: 5-instr PUSH0 HERE BIT ADD LOOP. 770k programs enumerated in 1.0s; 200/200 generalize.

levin-add-positions

Same Levin enumeration, different target: index-sum of the bit positions where the input is 1 (induces the linear weight vector w_i = i). Found program: length-3 im+. 58 evaluations to find; 200/200 generalize on held-out.

Hochreiter & Schmidhuber (1996) — LSTM can solve hard long time lag problems

rs-two-sequence

Bengio-94 latch task. Random-weight-guessing on a small fully-recurrent net solves what BPTT/RTRL fails on. The point is the algorithm: just sample weights uniformly, run forward, score. No mutation, no crossover, no gradient. 30/30 seeds solve, median 144 trials.

rs-parity

N-bit sequence parity (XOR of all input bits) by random weight guessing on a small recurrent net. The parity solution lives in a narrow weight-space basin RS happens to hit by chance. N=50 seed 0: 10,253 trials / 15.3s; N=500 seed 0: 412 trials / 3.2s.

rs-tomita

Random-weight guessing on Tomita grammars #1 (a*), #2 ((ab)*), and #4 (no aaa substring). Three regular languages of increasing difficulty. All 3 grammars solved across 10 seeds; trial counts within ~3× of paper for #1/#2, ~6× for #4.

Hochreiter & Schmidhuber (1997) — Long Short-Term Memory canonical battery

adding-problem

T=100 sequences with 2-D inputs: random reals + sparse markers. Target = sum of the 2 marked values. The first non-trivial LSTM benchmark. LSTM MSE 0.0007 (50× under paper’s 0.04 threshold); vanilla RNN MSE 0.0706 (gradient vanishes); 5/5 seeds clear; gradient check 1.6e-7.

embedded-reber

Reber grammar wrapped with outer T/P matching pair (long-range dependency). Original 1997 LSTM (input + output gate, no forget gate). 10/10 seeds, mean 4800 sequences vs paper 8440 — 1.8× faster with Adam + negative gate-bias init.

noise-free-long-lag

Two locally-encoded sequences (y, a₁,…,a_{p−1}, y) and (x, a₁,…,a_{p−1}, x). Sub-variant (a) at p=50: solved at sequence 600. Last-step gradient weighting trick (×100) keeps Adam’s per-step normalisation from drowning out the rare long-lag signal.

two-sequence-noise

Variant 3c (target noise σ=0.32). Canonical 1997 LSTM, 3 blocks × 2 cells = 6 cells, 103 params. Output-gate biases per block = -2, -4, -6 (paper’s recipe). 4/4 seeds 100% accuracy on noiseless test sequences.

multiplication-problem

Same as adding-problem but target = product of the 2 marked values. LSTM with forget gate (Gers 2000). MSE 0.0028 at T=30 (17× chance); 3/5 seeds converge — paper-faithful per-seed brittleness.

temporal-order-3bit

Two information-carrying symbols X, Y at unknown positions; classify the temporal order (XX, XY, YX, YY). Original 1997 LSTM (no forget gate). 5/5 seeds 100%, median ~6.4k seqs vs paper 31,390 (Adam advantage). Vanilla RNN at chance 0.25.

Mid-90s — Evolutionary, RL, and feature detection

Salustowicz & Schmidhuber (1997) — Probabilistic Incremental Program Evolution

pipe-symbolic-regression

Symbolic regression on Koza’s classic benchmark f(x) = x⁴ + x³ + x² + x. Probabilistic Prototype Tree (PPT) over {+, −, *, /, x, R}. PBIL update toward elite at every visited node; per-component mutation along elite path. No gradient, no crossover. Seed 3 finds the exact polynomial at gen 60.

pipe-6-bit-parity

Same PIPE machinery on Boolean function set {AND, OR, NOT, IF, x_0..x_5}. Bitmask program evaluator runs all 64 inputs in O(tree_size) bitwise ops. 4-bit even parity solves cleanly at gen 258 (16/16); 6-bit reaches 71.9% at the 240s budget cap.

Schmidhuber, Zhao, Wiering (1997) — Shifting inductive bias with SSA

ssa-bias-transfer-mazes

Success-story algorithm: keep a stack of policy modifications; only retain modifications that produce statistically significant lifetime-reward improvements (history-conditioned, not per-task). Bias from one task transfers to the next. 4 sequential POM mazes; SSA tail solve 0.83 vs no-SSA 0.70 (+19%).

Wiering & Schmidhuber (1997) — HQ-learning

hq-learning-pomdp

Hierarchical Q(λ) for POMDP. M sub-agents with their own Q-tables; control transfers between sub-agents at sub-goal observations. Honest non-replication: paper’s HQ-vs-flat gap doesn’t reproduce on the 29-cell maze. Mathematical analysis: γ^Δt · HV ≤ R_goal bound prevents per-corridor specialization on small mazes. v1.5 follow-up flagged at paper’s 62-cell maze.

Schmidhuber, Eldracher, Foltin (1996) — Semilinear PM

semilinear-pm-image-patches

Linear encoder y = Wx on the Stiefel manifold (polar projection after every step). Predictor input is the standardised squared code z = (y² - μ) / σ (the squaring is the one nonlinearity — “semilinear”). Synthetic 1/f² pink-noise + oriented bars input. Result: V1-style oriented edge detectors emerge, like ICA.

Hochreiter & Schmidhuber (1999) — LOCOCODE

lococode-ica

Tied autoencoder + L1 sparsity on whitened input (surrogate for the paper’s flat-minimum-search Hessian penalty). On synthetic Laplacian sources: Amari distance 0.093 — 4× better than PCA (0.388), within 5× of FastICA (0.022). Demonstrates that low-complexity coding produces ICA-like sparse independent components.

2000–2002 — LSTM follow-ups

Gers, Schmidhuber, Cummins (2000) — Learning to forget

continual-embedded-reber

Embedded Reber strings concatenated without any episode reset. Mechanism contrast made visible: forget-gate LSTM cell-state norm stabilizes at ~25; no-forget-gate norm grows to ~295 across the stream. Forget gates drop at end-of-string offsets. 5/5 forget seeds solve (99.7%) vs 5/5 no-forget at chance (55%).

Gers & Schmidhuber (2001) — Context-free and context-sensitive languages

anbn-anbncn

Two formal languages: a^n b^n (context-free) and a^n b^n c^n (context-sensitive). Peephole LSTM (Gers 2002 cell). Cell 0 emerges as a clean linear counter — charges during a’s, discharges during b’s. Trained n=1..10 → generalizes a^n b^n to n=1..65; a^n b^n c^n to n=1..29.

Gers, Schraudolph, Schmidhuber (2002) — Learning precise timing

timing-counting-spikes

Measure-Spike-Distance (MSD): two input spikes at t1 < t2; network must fire at t1 + 2·(t2 - t1). Peephole LSTM (cell state feeds gates). One cell develops an analog interval timer across the inter-spike gap. Honest partial: paper’s “vanilla fails entirely” doesn’t fully reproduce at short-MSD scale; v1.5 path: T ≥ 300, longer training.

Eck & Schmidhuber (2002) — Blues improvisation

blues-improvisation

12-bar bebop blues. Fixed chord progression: C7 C7 C7 C7 / F7 F7 C7 C7 / G7 F7 C7 C7. 2-layer stacked LSTM (chord layer H1=20 → melody layer H2=24). 8 hand-synthesized 12-bar choruses (no external MIDI). 12/12 bar-onset chord match; on-beat note rate 0.792.

2002–2010 — Evolutionary RL, OOPS, BLSTM+CTC

Schmidhuber, Wierstra, Gomez (2005/2007) — Evolino

evolino-sines-mackey-glass

Hybrid neuroevolution + linear regression for sequence learning. LSTM hidden weights evolved by population selection + gaussian mutation + crossover; output layer trained per-individual via Moore-Penrose pseudo-inverse on the recurrent state’s time-series. Hidden weights NOT trained by gradient. Two tasks: superimposed sines, Mackey-Glass.

Gomez & Schmidhuber (2005) — Co-evolving recurrent neurons

double-pole-no-velocity

Cart with two stacked poles of different lengths (canonical hard non-Markov RL benchmark). Hidden velocities — only positions observed. Wieland 1991 double cart-pole sim in numpy, RK4 integration. Enforced Sub-Populations (ESP, Gomez 2003): H=5 subpopulations, network assembled by stacking one neuron per subpop; fitness propagates back. 7/10 seeds 20/20 generalize at pop=40 (paper’s pop=200, ~5× cheaper).

Graves et al. (2005/2006) — BLSTM and Connectionist Temporal Classification

timit-blstm-ctc

Synthetic phoneme corpus (K=6 phonemes, 8 mel-like bands, co-articulated shared-onset clusters so future context disambiguates). Bidirectional LSTM + log-space CTC forward-backward. BLSTM 1.87× faster than uni-LSTM (5/5 seeds 300 vs 560 iters); mid-training PER gap 0.27 vs 1.00.

Graves, Liwicki, Fernández, Bertolami, Bunke, Schmidhuber (2009) — Unconstrained handwriting

iam-handwriting

10-character hand-crafted alphabet, each glyph from ellipse arcs + line segments; 47-word vocab; per-word affine slant + per-point Gaussian jitter. BLSTM + CTC reads pen-trajectory data. In-vocab CER 0.082 / word acc 0.77; held-out compositional CER 0.647 honestly flagged.

Schmidhuber (2002–2004) — Optimal Ordered Problem Solver

oops-towers-of-hanoi

Towers of Hanoi: move n disks from peg 0 to peg 2; optimal solution length 2^n - 1. OOPS = Levin search with reusable subroutines. Discovers 6-token recursive solver SD C SD M SA C at n=3; reuses with zero search from n=4 onward. Verified through n=15 (32767 moves).

2010–2017 — Deep learning at scale

Cireşan, Meier, Gambardella, Schmidhuber (2010) — Deep, big, simple nets

mnist-deep-mlp

MNIST classification with a plain feedforward MLP — no convolution, no pretraining, no model averaging — on heavily deformed training data. Per-batch affine + Simard elastic deformation in pure numpy (separable Gaussian + bilinear sampling). 1.17% test err / 15 epochs / 79s.

Cireşan, Meier, Schmidhuber (2012) — Multi-column DNN

mcdnn-image-bench

Single-column 4-layer ReLU MLP on MNIST (paper’s multi-column ensemble + GTSRB/CASIA deferred to v1.5). 1.46% test err; multi-seed mean 1.47% ± 0.03%. Honest gap: paper 35-column ensemble 0.23%, single CNN ~0.4%.

Cireşan, Giusti, Gambardella, Schmidhuber (2012) — EM segmentation

em-segmentation-isbi

Synthetic Voronoi-EM substitute for ISBI 2012 stack: random Voronoi tessellation + dark 1-px boundaries + per-cell intensity + Gaussian noise + sparse organelles + 3×3 PSF blur. MLP pixel classifier on 32×32 patches. ROC AUC 0.989 vs Sobel+intensity 0.880; pixel acc 95.97%.

Srivastava, Masci, Kazerounian, Gomez, Schmidhuber (2013) — Compete to compute

compete-to-compute

LWTA (Local Winner-Take-All): groups of k=2 units per layer; only the per-group winner forwards activations, others zero out; gradient flows only through the winner. Sequential 2-task MNIST split (digits 0-4 → 5-9). LWTA forgetting 0.022 vs ReLU 0.072 seed 0 (3.3× less forgetting); 10-seed: LWTA wins 6/10.

Srivastava, Greff, Schmidhuber (2015) — Highway Networks

highway-networks

Gated deep MLP: y = H(x)·T(x) + x·(1−T(x)) with learned sigmoid gate T. Depth comparison 5/10/20/30/50: highway stable at all depths (0.926 at depth 30); plain MLP dies past depth 10 (stuck at chance 0.124). Plain’s loss pinned at log(10) — gradients vanish through 30 saturating tanh layers.

Greff, Srivastava, Koutník, Steunebrink, Schmidhuber (2017) — LSTM Search Space Odyssey

lstm-search-space-odyssey

8 LSTM variants in one ablation matrix: V (vanilla), NIG (no input gate), NFG (no forget gate), NOG (no output gate), NIAF (no input activation), NOAF (no output activation), CIFG (coupled input-forget), NP (no peepholes). All implemented behind one VariantFlags flag set. CIFG ranks 1st, NIG last across 3/3 seeds — matches paper’s “CIFG almost free” claim. Gradient check 1.31e-7.

Koutník, Greff, Gomez, Schmidhuber (2014) — Clockwork RNN

clockwork-rnn

Standard Elman RNN with hidden layer partitioned into G modules. Each module g has a clock period T_g; at timestep t a module updates only when t mod T_g == 0. Forward connections only flow from slower clocks to faster clocks. Synthetic sum-of-sines T=320, periods 8/32/80/160. CW-RNN MSE 0.117 vs matched-param vanilla 0.250 — 2.22× mean over 5 seeds.

Koutník, Cuccu, Schmidhuber, Gomez (2013) — Vision-based RL via evolution

torcs-vision-evolution

Numpy oval racing track + 16×16 pixel observation. MLP 256→16→1 with W1 parameterized by a 4×4=16 low-frequency 2-D DCT block per hidden unit (decoded via precomputed orthonormal IDCT-II matrix). Natural ES (antithetic sampling, rank-shaped fitness) on 289 numbers; equivalent raw-W1 search would be 4129 numbers. 14.3× compression.

Greff, van Steenkiste, Schmidhuber (2017) — Neural EM

neural-em-shapes

Unsupervised perceptual grouping. K=3 slot Neural EM with manual BPTT through T=4 unrolled EM iterations. E-step softmax over pixel likelihoods, M-step tanh recurrence on bottlenecked H=24 (forces specialisation). Best test NMI 0.428 at epoch 7 (chance 0.33); slot-collapse drift after epoch 7 documented as v1.5 fix.

van Steenkiste, Chang, Greff, Schmidhuber (2018) — Relational Neural EM

relational-nem-bouncing-balls

Bouncing balls with elastic equal-mass collisions. Oracle 4-D slot state (x, y, vx, vy). Non-relational baseline: MLP_dyn(s_k); relational: MLP_msg(s_k, s_j) → mean aggregation → MLP_dyn(s_k, agg_k). Relational wins K=3,4,5; loses K=6 (distribution shift dominates).

2018–2025 — World models, fast-weight Transformers, systematic generalization

Ha & Schmidhuber (2018) — Recurrent World Models

world-models-carracing

Numpy 2-D top-down racing track substitute for CarRacing-v0. Centerline = closed loop generated from low-frequency sinusoids; agent observes a 16×16 patch of mask, rotated to car frame. V (encoder) + M (LSTM world-model) + C (linear policy) — all the paper’s three modules, evolved by simplified rank-μ ES. V+M+C +103.8 mean across 5/5 seeds (random +4.84) — ~21× random.

world-models-vizdoom-dream

Numpy 5×5 gridworld dodging-fireballs analog of DoomTakeCover. The paper’s “DoomRNN dream” experiment: controller C is trained ENTIRELY inside M’s rollouts (no real-env interaction during training), then transferred zero-shot to the real env. Dream-trained C: 49.1 ± 14.8 vs random 22.4 ± 18.3 — 2.2× random; matches/exceeds real-baseline on 2/5 seeds.

Schmidhuber et al. (2019) — Reinforcement Learning Upside Down

upside-down-rl

Standard RL fits a value function or policy gradient. UDRL inverts: the policy is a supervised mapping from (state, desired_return, time_horizon) → action. Numpy 9-state chain MDP per SPEC’s RL-stub rule (paper used LunarLanderSparse). 5/5 seeds reach +4.70 at R*=5.0; achieved return monotonically tracks commanded R*.

Schlag, Irie, Schmidhuber (2021) — Linear Transformers ARE Fast Weight Programmers

linear-transformers-fwp

The cleanest result of the catalog: linear self-attention V^T(Kq) and the 1992 fast-weight programmer (V^T K)q compute the same numpy expression. Equivalence verified to 2.22e-16 (1 ulp at float64) on every input tested. Side-by-side visualization shows linear-attention scores + FWP scratchpad + retrieval bars match to round-off. Cross-references the wave-4 sibling fast-weights-key-value (1992 ancestor).

Csordás, Irie, Schmidhuber (2022) — The Neural Data Router

neural-data-router

Compositional table lookup: 4 values × 4 functions × depth-d expressions. NDR adds two switches to a Transformer: geometric attention (per-query distance-ordered scan, “stop at first match”) + per-position copy gate. Test depth 5 (+1 above training): NDR 0.60 vs vanilla 0.32 (chance 0.25); 3-seed NDR 0.405 ± 0.013 vs vanilla 0.296 ± 0.031 (NDR wins 3/3). Honest +1-depth gain vs paper’s “100% length generalization” claim.

How the GIFs and viz folders are generated

problem-folder/
├── README.md                  source paper, problem, results, deviations
├── <slug>.py                  dataset + model + train + eval
├── visualize_<slug>.py        training curves + weight viz (writes to viz/)
├── make_<slug>_gif.py         animated GIF (writes <slug>.gif)
├── <slug>.gif                 committed animation
└── viz/                       committed PNGs

To regenerate any GIF or PNG locally:

cd <problem-folder>
python3 visualize_<slug>.py     # static figures
python3 make_<slug>_gif.py      # animated GIF

Seeds and hyperparameters are documented in each folder’s README. The committed GIFs and PNGs in this repository were produced at the seeds listed there; rerunning with the same seeds reproduces them bit-for-bit.

Where to go next

For comparison numbers: RESULTS.md — every stub’s paper-vs-implemented headline metric in one table, with a v2-filter recommendation section.
For the research goal these baselines exist for: v2 ByteDMD instrumentation — these 58 implementations are the substrate the data-movement cost tracer will run against.
For original-simulator reruns: per-stub §Open questions sections track v1.5 / v2 paths back to gym CarRacing-v0, VizDoom DoomTakeCover, TORCS, TIMIT, IAM, ISBI.
For the build process: BUILD_NOTES.md — session report, agent-team orchestration, wave-by-wave timeline.

Keyboard shortcuts

Schmidhuber Problems