Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

temporal-order-3bit

Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8): 1735–1780. Experiment 6a (Temporal Order, 3-bit).

training animation

Problem

Each input sequence runs T = 50 symbols, drawn from an 8-symbol alphabet:

{a, b, c, d}  random distractors
{X, Y}        the two information-carrying symbols
{B, E}        sequence-start and sequence-end markers

Position 0 is always B, position T-1 is always E. Two slots t1 ∈ [3, 12] and t2 ∈ [25, 40] carry independently drawn symbols from {X, Y}. Every other interior slot is a uniform random distractor. The class label encodes the order of the two important symbols:

(first, second)class idname
(X, X)0XX
(X, Y)1XY
(Y, X)2YX
(Y, Y)3YY

Inputs are one-hot vectors of dimension 8. The network reads the whole sequence, then emits a 4-way softmax at the final time step. The minimum lag between the two informative symbols is 25 − 12 = 13, the maximum is 40 − 3 = 37. The network must hold the identity of the first marker across that gap while ignoring 13–37 distractor symbols.

What it demonstrates

A vanilla recurrent net with tanh activations cannot bridge the gap and stays at chance accuracy (≈ 0.25). An LSTM with the input-gate/output-gate cell of the 1997 paper (no forget gate, pure constant-error carousel) solves it to 100 %. Inspecting the trained net shows the input gate firing only on the two X/Y positions and the cell state encoding their order in the sign of two different cells.

Files

FilePurpose
temporal_order_3bit.pyDataset generator, LSTM with BPTT, vanilla-RNN baseline, training loops, gradient check, CLI.
visualize_temporal_order_3bit.pyReads results.json + snapshots.npz, writes static PNGs into viz/.
make_temporal_order_3bit_gif.pyBuilds the cell-state animation temporal_order_3bit.gif from the snapshot tensor.
temporal_order_3bit.gifCell-state heatmap evolving through training, one frame per ≈ snapshot.
viz/training_curves.pngLSTM vs RNN loss + accuracy.
viz/confusion_matrix.pngLSTM 4×4 confusion matrix on validation set.
viz/example_sequences.pngOne example sequence per class as a token-time heatmap.
viz/input_gate_activity.pngMax input-gate activation per time step on those examples.
viz/hidden_trajectories.pngCell state c_t and hidden state h_t per time step, per class.
viz/cell_state_heatmap.pngFinal cell state as a (cell index × time) heatmap.
results.jsonFull training log (steps, loss, accuracy, confusion matrix).
snapshots.npzCaptured hidden-state tensors for the GIF and trajectory plots.

Running

The headline command (≈ 24 s on an M-series laptop, single core):

python3 temporal_order_3bit.py --seed 0 \
    --n_steps 1500 --batch 32 --hidden 4 \
    --val_n 512 --eval_every 50 --record_hidden
python3 visualize_temporal_order_3bit.py
python3 make_temporal_order_3bit_gif.py

Self-test of the analytic LSTM gradient (max relative error vs central differences):

python3 temporal_order_3bit.py --gradcheck
# [gradcheck] max relative error = 2.363e-11

Results

Headline run, seed 0:

MetricValue
LSTM final validation accuracy (512 sequences)1.000 (512 / 512 correct)
LSTM step at first ≥ 95 % validation accuracy100 (= 3 200 sequences at batch 32)
RNN final validation accuracy0.250 (chance)
RNN best-ever validation accuracy0.266
LSTM training wall-clock13.6 s
RNN training wall-clock10.6 s
Total training sequences seen48 000 = 1 500 × 32
Trainable parameters (LSTM)184 (Wi, Wo, Wg ∈ R^{12×4} + biases + Why ∈ R^{4×4} + by)
Trainable parameters (RNN)68 (Wx ∈ R^{8×4}, Wh ∈ R^{4×4}, bh, Why, by)

Hyperparameters used:

HyperparameterValue
Sequence length T50
Hidden / cell count4
Batch size32
OptimiserAdam (lr = 0.02, β₁ = 0.9, β₂ = 0.999)
Gradient clip (global ℓ²)1.0
Steps1500
Input-gate bias init−1.0 (cell starts closed)
Other parameter initN(0, 0.1²)

Multi-seed reliability (--seed 0..4, otherwise identical config):

seedLSTM final accRNN final accfirst-step ≥ 95 %
01.0000.238100
11.0000.293200
21.0000.230100
31.0000.254300
41.0000.258200

5 / 5 seeds solve. Median 200 steps to 95 % (≈ 6 400 sequences). The 1997 paper reports 31 390 sequences for a slightly larger sequence and an LSTM with 156 weights; we converge faster because of Adam (the paper used plain SGD with momentum).

Confusion matrix on 512 validation sequences (seed 0):

pred XXpred XYpred YXpred YY
true XX119000
true XY012800
true YX001340
true YY000131

Visualizations

temporal_order_3bit.gif — Cell state c_t for one held-out sequence per class, animated across training. At step 1 the heatmap is uniformly near zero. As training proceeds, a dark-then-light spike appears at the first X/Y position and a second spike at the second one; by step ≈ 200 the first cell carries the identity of the first marker (positive for X, negative for Y) and the second cell carries the second. Vertical ticks mark X (green) and Y (red) positions on the input.

viz/training_curves.png — Cross-entropy loss and validation accuracy for LSTM (blue) and vanilla RNN (orange). The LSTM curve drops from log 4 ≈ 1.39 to near zero around step 100; the RNN curve plateaus near log 4 and the accuracy line never lifts off the 0.25 chance line.

viz/confusion_matrix.png — A diagonal matrix: every class is recovered without a single confusion on 512 held-out sequences.

viz/example_sequences.png — One example sequence per class rendered as an 8 × 50 binary heatmap. Vertical lines mark the X (red) and Y (blue) positions.

viz/input_gate_activity.png — Max-over-cells input gate max_k i_t^{(k)} plotted as bars for those four sequences. The gate fires only on the two informative time steps and stays near zero on every distractor; the negative bias initialisation matters.

viz/hidden_trajectories.png — Two-row strip of c_t (top) and h_t (bottom) for each class. The cell trajectories show clear stepwise jumps at t1 and t2; h_t only carries information at the moment the output gate opens (the last few steps before the readout).

viz/cell_state_heatmap.pngc at the end of training, plotted as a H × T heatmap per class. The four classes are visually separable in cell space.

Deviations from the original

DeviationWhat the paper usedWhat we usedReason
Sequence length100–110 (and a longer “6b” variant for 4-bit)50Keeps the experiment under 30 s on a CPU laptop; the paper’s lag of ~30 distractors is preserved (t1 ∈ [3,12], t2 ∈ [25,40]).
Marker positionst1 ∈ [10,20], t2 ∈ [50,60]t1 ∈ [3,12], t2 ∈ [25,40]Scaled with the shorter length. The qualitative claim — that the network must integrate information across many distractor symbols — is unchanged.
Cell architecture2 cell blocks of size 2 (4 cells, gated together as 2 blocks)4 independent cells (no block structure)Block sharing of gates only saves parameters; with hidden = 4 the difference is small, and a flat layout is easier to read out and visualise.
OptimiserSGD with momentumAdam (lr = 0.02)Matches what the rest of the wave-6 stubs use; the paper’s optimiser converges in ~31 k sequences, ours converges in ~6 k. The algorithmic claim — long-time-lag credit assignment via a CEC — is what we are testing, not the optimiser.
Forget gatenot in 1997 NCnot present (matches the paper)The paper’s CEC has no forget gate; the forget gate was added by Gers, Schmidhuber & Cummins (2000). We follow the 1997 formulation.
Output activationsoftmax over 4 classessoftmax over 4 classesMatch.
Losscross-entropy at end of sequencecross-entropy at end of sequenceMatch.
Validation set sizeunspecified in the paper512 sequences, fresh seedOurs is reused across the whole run for fair comparison between LSTM and RNN.
Baseline“RTRL fully recurrent net”BPTT vanilla tanh-RNN with the same hidden size and the same Adam settingsBoth fail; the failure mode is qualitatively the same (cannot push gradient through 30+ distractor steps). RTRL would be slower per step but no more capable on this task.
Sequence-end markerB end-of-sequence symbolE (chose a distinct token to avoid colliding with the start-marker B used elsewhere in the alphabet)Cosmetic.

Open questions / next experiments

  • Block-structured cells. The paper shares gate weights inside a “memory block.” Sharing should make the input gate fire even more cleanly on the X/Y positions because all cells in a block see the same gate decision. Worth a five-minute follow-up.
  • Length scaling. The current experiment uses T = 50. Does the same hidden size still solve T = 100 (paper’s setting), T = 200, T = 500? The CEC has no decay, so in principle yes — the limiting factor is the optimiser, not the architecture. A length sweep would confirm.
  • Forget-gate ablation. Adding a forget gate (Gers 2000) speeds up the noise-free long-lag and adding-problem stubs but is not needed here. Worth a side-by-side once the wave-6 family is in place.
  • Citation gap. The 1997 NC paper’s “31 390 sequences” figure is reported in the literature but is not split by seed or by reset; we cannot tell whether their median or worst-case run is the headline. Our number (≈ 6 400 sequences, median over 5 seeds) is not directly comparable. If we want a like-for-like number we have to (a) match their architecture exactly, (b) match their optimiser, (c) report a 30-seed median with their stopping criterion. Tracked as a v2 follow-up.
  • DMC instrumentation (v2). Wrap forward + backward in bytedmd and report data-movement cost per training step. Expectation: distractor steps cost almost nothing because the input gate is near zero and the cell state is unchanged, so reads of c_{t-1} are repeats. The 1997 LSTM is a remarkably “data-movement friendly” recurrent architecture.

agent-0bserver07 (Claude Code) on behalf of Yad