timing-counting-spikes

Gers, Schraudolph, Schmidhuber, Learning Precise Timing with LSTM Recurrent Networks, JMLR 3:115-143, 2002. The paper introduced peephole connections (cell state feeds the gates directly) to let LSTM solve precise-timing tasks the vanilla 1997 cell could not.

timing-counting-spikes training animation

Problem

The paper poses three timing tasks; we implement MSD (Measure-Spike-Distance) as the headline:

Each sequence has length T = 150 and a single binary input channel. Two input spikes appear at times t1 < t2 < T with separation D = t2 - t1, drawn uniform in [D_min, D_max] = [30, 60]. The network must produce an output spike at exactly t_target = t1 + 2D (the same gap D after the second input spike). Background channel is zero everywhere except on the two spike steps.

channel	value	when
input	1.0	at `t1`, `t2`. 0.0 elsewhere
target	1.0	at `t_target = t1 + 2D`. 0.0 elsewhere

Loss: per-timestep MSE between scalar output and the delta target. A sample is “solved” if argmax(pred[t2+1 : T]) is exactly t_target (tol = 0).

GTS (Generate Timed Spikes) and PFG (Periodic Frequency Generation), the other two task families in the paper, are not implemented in v1 (see §Open questions).

What it demonstrates

Peephole LSTM emits a spike at exactly the right step, with test MSE 0.00073 and exact-timing solve rate 0.998 on seed 4.
Vanilla LSTM (same architecture minus the three peephole vectors) trained under the identical recipe reaches solve_rate = 0.900, MSE 0.00240 - it learns the task but at lower precision, with ~10% of held-out spikes off by at least one step.
The cell-state heatmap (viz/cell_state.png) shows one cell building up an analog “interval timer” between the two input spikes and crossing a threshold exactly at t_target - the canonical peephole story.

Files

File	Purpose
`timing_counting_spikes.py`	LSTM cell with optional peephole connections, manual BPTT, Adam optimizer, MSD dataset generator, gradcheck, CLI. Single file, pure numpy.
`visualize_timing_counting_spikes.py`	Trains both peep and no-peep variants and writes static plots to `viz/`: training curves, sample predictions side-by-side, peephole-LSTM cell-state heatmap, peephole weights, gate weight matrices.
`make_timing_counting_spikes_gif.py`	Trains the peephole LSTM with snapshots and renders `timing_counting_spikes.gif`: a held-out test sequence + the test-MSE / solve-rate curve, frame per snapshot.
`viz/`	PNGs from the run below.
`timing_counting_spikes.gif`	Animation at the top of this README.

Running

Headline run (peephole LSTM, seed 4):

python3 timing_counting_spikes.py --seed 4 --peep \
    --T 150 --D-min 30 --D-max 60 --hidden 8 \
    --iters 3000 --batch 32 --lr 5e-3

Vanilla-LSTM baseline (same recipe, no peephole connections):

python3 timing_counting_spikes.py --seed 4 --no-peep \
    --T 150 --D-min 30 --D-max 60 --hidden 8 \
    --iters 3000 --batch 32 --lr 5e-3

Numerical gradient check on both variants:

python3 timing_counting_spikes.py --gradcheck

Static visualizations + GIF (regenerates everything in viz/ and the GIF):

python3 visualize_timing_counting_spikes.py --seed 4 --outdir viz
python3 make_timing_counting_spikes_gif.py --seed 4 \
    --snapshot-every 200 --fps 5

Wallclock on an Apple-silicon laptop (M-series, single CPU core):

step	wallclock
`timing_counting_spikes.py` peephole headline	~32 s
`timing_counting_spikes.py` vanilla baseline	~24 s
`--gradcheck`	~1 s
`visualize_timing_counting_spikes.py`	~58 s
`make_timing_counting_spikes_gif.py`	~35 s

End-to-end reproduction of every artifact in this folder is well under 3 minutes, comfortably inside the SPEC’s 5-minute budget.

Results

T = 150, D in [30, 60], hidden H = 8, batch 32, lr = 5e-3 halving every 1500 iters, 3000 training iters (96 000 sequences). Adam, global L2 gradient clip at 1.0. Forget-gate bias initialized to 1.0. Output is a scalar linear readout (no sigmoid).

Headline (seed 4)

variant	final test MSE	solve rate (exact)	sequences seen	wallclock
peephole LSTM	0.00073	0.998	96 000	32 s
vanilla LSTM (no peep)	0.00240	0.900	96 000	24 s

Eval is on 512 held-out sequences sampled from a separate test RNG; “solve rate” requires the predicted-spike step to match the target step exactly.

7-seed sweep (same recipe)

seed	peep MSE	nope MSE	peep solve	nope solve
0	0.00347	0.00400	0.668	0.600
1	0.00046	0.00100	1.000	1.000
2	0.00137	0.00107	0.900	1.000
3	0.00209	0.00293	0.865	0.645
4	0.00073	0.00239	1.000	0.904
5	0.00204	0.00059	0.965	1.000
6	0.00257	0.00156	0.766	0.959
mean	0.00182	0.00193	0.881	0.873

Both variants clear solve_rate >= 0.6 on every seed within the 3000-iter budget; both reach 1.000 on at least one seed; the peephole variant is ~5% lower MSE on average. The cleanest peephole-vs-vanilla contrast within budget is at seed 4 (used as the headline above), where the peephole solve rate is 1.000 and vanilla stalls at 0.900. Three seeds (2, 5, 6) actually favor the vanilla variant. The paper claims the vanilla LSTM “fails on all three tasks”, which we do not reproduce at this short-MSD scale on a 5-minute laptop budget; see §Open questions and §Deviations.

Gradient check

gradcheck (peep=True):  max rel err = 1.65e-07 over 25 samples (tol 1e-04)
gradcheck (peep=False): max rel err = 1.88e-07 over 25 samples (tol 1e-04)

Numerical and analytical gradients agree to within ~1e-7 for every weight (including all three peephole vectors p_i, p_f, p_o), confirming the manual BPTT in timing_counting_spikes.py.

Visualizations

Training curves (peephole vs vanilla LSTM)

training curves

Test MSE (log scale) and exact-timing solve rate over the 3000-iter training run, seed 4. The peephole LSTM falls another half-decade in MSE after iteration ~2200 once it has bound the cell-state counter to the output gate via p_o; the vanilla LSTM plateaus near 2e-3 MSE and 0.9 solve rate.

Sample predictions (held-out test set)

sample predictions

Four held-out test sequences with D in [33, 59]. Gray spikes are the inputs (at t1, t2). The green vertical bar is the target (at t_target = t1 + 2D). The peephole LSTM (blue, solid) puts a sharp peak right on the green bar; the vanilla LSTM (red, dashed) fires near the right place but is sometimes off by a step or attenuated.

Peephole LSTM cell state on a long-D sample

cell state

Top: the input spike train (the two spikes at t1=3, t2=59, target 115). Middle: cell states c_t for each of the 8 hidden units across the 150 time steps. Bottom: the network’s scalar output. Cell 0 starts to ramp up after the second input spike (dotted vertical line at t2), monotonically grows across the distractor stretch, and crosses a positive threshold right at the target step - exactly the “analog interval timer” behavior the peephole connection is designed to allow. The output gate, fed directly by c_t via p_o, opens at the right step.

Peephole weights

peephole weights

The three peephole vectors after training, one weight per cell. p_i (c_{t-1} -> i) and p_f (c_{t-1} -> f) gate the recurrence of each cell’s own counter; p_o (c_t -> o) is the “trigger” - the output gate’s coupling to the cell that holds the timer. Cells 1, 4, 5, 7 have the largest |p_o| and are the ones the trained LSTM uses to drive the output spike (consistent with the cell-state heatmap above showing cell 0 + a few neighbours carrying the count).

Gate weight matrices (peephole LSTM)

weights

Standard LSTM gate weights after training. Top: input -> gate (one row per input dim, here just the spike channel). Bottom: hidden -> gate. The recurrent Wh -> i and Wh -> f matrices encode the count-and-hold mechanism; the readout Wy (not plotted) projects the activated cell to the scalar output.

Deviations from the original

Task scale. Paper used much longer sequences (T up to ~500-1000 for GTS, even longer for the periodic-function-generation variants) and much longer intervals. We use T = 150, D in [30, 60] to stay inside the 5-minute laptop budget. At this scale the vanilla 1997 cell does not completely fail (paper claim) - it learns the task at slightly lower precision. The dramatic peephole-only demos require T >> 200; see §Open questions.
Optimizer. Paper used a custom RTRL-flavored gradient update with separate learning rates per gate. We use Adam (lr = 5e-3, global L2 gradient clip at 1.0, LR halved every 1500 iters). Adam is a strict superset of paper-style adaptive rates and is what every modern LSTM reproduction uses.
Mini-batches. Paper trained one sequence at a time. We batch 32 for numpy throughput. Gradient is averaged over the batch.
Forget gate. Paper’s vanilla LSTM had no forget gate (c_t = c_{t-1} + i_t * g_t). We use the modern variant from Gers/Schmidhuber/Cummins 2000 (c_t = f_t * c_{t-1} + i_t * g_t) with forget bias 1.0 - the same recipe as adding-problem and the rest of wave 6, and the standard since 2000. Our --no-peep baseline is therefore “Gers/Schmidhuber/Cummins 2000 LSTM”, strictly stronger than the literal 1997 cell. The paper’s contrast (peephole vs 1997 cell) would show a larger gap.
Output non-linearity. Paper’s MSD readout used sigmoid. We use a raw linear scalar output - cleaner gradient story, identical downstream task because the spike target is 0/1 and the loss is MSE.
Peephole init. Paper used “small random” init for p_i, p_f, p_o. We use randn(H) * 0.1. We tried zero-init, which is slightly worse on average (peep stops being initialised away from the no-peep solution and the optimizer has to break the tie with cell-specific peep updates).
MSD only. Paper has three timing tasks; we implement only MSD in v1. GTS (Generate Timed Spikes - same architecture, no input spikes, network must spike at fixed period) and PFG (Periodic Function Generation) are open follow-ups.
No memorized train/test split. Paper drew a finite training set and a separate test set. We sample on the fly from independent train/test RNGs - long-standing modern convention for synthetic benchmarks.

Open questions / next experiments

Reproduce the dramatic peep-only regime. The paper’s headline claim is that vanilla LSTM fails entirely on MSD/GTS/PFG. At our T = 150, D in [30, 60] scale, vanilla still solves ~90% of held-out samples within budget. Plausibly the paper’s failure is at T >= 300, D >= 100, where the vanilla LSTM’s count-via-tanh-bottleneck saturates. Sweep T in {300, 600, 1000} (with longer iter budget; out of v1 scope) and document where vanilla cleanly breaks.
GTS and PFG. The other two paper tasks should also fall out of the same code with small dataset changes: GTS = drop the input spikes entirely, target is a periodic spike train at fixed period sampled per trial (period encoded in a one-hot start signal); PFG = continuous sinusoidal target. Add --task {msd, gts, pfg} and a second visualisation script.
Cell-state-as-counter inspection. The cell-state heatmap shows cell 0 carrying an analog timer. Quantify: what fraction of cells in the trained peephole LSTM carry monotonic interval timers? The paper called this “analogue counter” but never measured it explicitly.
Effect of zero-init peephole weights. A 7-seed sweep with p_* init to zero gives slightly worse mean solve rate (0.79 vs 0.88). Why? The hypothesis is that random peep init breaks symmetry between cells; with zero init, optimizer has to drive peep weights from zero through the cell-update equation, which is gradient-bottlenecked early in training. Verify with a longer-iter run.
Energy / data-movement. Peephole LSTM’s appeal in 2002 was expressivity, but the cell adds three diagonal vectors per layer at near-zero compute cost. ByteDMD instrumentation (v2) should show peephole’s gradient stack-distance is essentially identical to vanilla LSTM, while accuracy is higher - a free lunch on the data-movement metric.
Failure mode of seed 0. Both variants converge to ~0.6 solve rate on seed 0 within budget (peep 0.668, vanilla 0.600). Diagnose whether this is a learning-rate-decay-too-fast issue or a bad init basin (likely the latter; the cell-state ramp doesn’t form for the right D-magnitude).

Keyboard shortcuts

Schmidhuber Problems