blues-improvisation

Eck & Schmidhuber, Finding Temporal Structure in Music: Blues Improvisation with LSTM Recurrent Networks, NNSP 2002 (also IDSIA-07-02).

training animation

Problem

A 12-bar bebop blues. The chord progression is fixed:

| C7 | C7 | C7 | C7 | F7 | F7 | C7 | C7 | G7 | F7 | C7 | C7 |

Time is quantised to eighth notes (8 steps per bar × 12 bars = 96 steps per chorus). At each step the network observes a symbolic vocabulary:

chord, one of 3 (C7, F7, G7) — one-hot, 3 dims
pitch, one of 8 (C blues scale across two octaves + REST) — one-hot, 8 dims

So the input is an 11-dim multi-hot vector per step. The model is trained next-step on a small synthesized corpus of 8 hand-constructed choruses (all sharing the canonical chord progression but with different melodies). After training, it is run free-running from a single primer step, sampling one chord/pitch token at a time.

The Eck & Schmidhuber 2002 headline claim is that LSTM, unlike vanilla RNNs, keeps the chord-progression structure stable over indefinitely many bars while improvising a new melody on top.

What it demonstrates

After 200 epochs (≈3 s), free-running the trained 2-layer LSTM with deterministic chord (argmax) and sampled pitch (T = 0.85) produces a chorus where:

all 12 bar-onset chords match the canonical progression (12/12),
90.6% of step-level chord assignments match the progression,
79.2% of strong-beat steps (positions 0 and 4 of each bar) are non-rest notes (“on-beat hits”),
87.7% of non-rest pitches are chord-tones of the current chord.

That’s the headline: the LSTM has learned both the long-range chord progression (period 96 steps) and a chord-aware pentatonic melody, with no external MIDI dataset.

Files

File	Purpose
`blues_improvisation.py`	Synthesized corpus + 2-layer LSTM + manual BPTT + Adam + free-running generator. CLI.
`visualize_blues_improvisation.py`	Static PNGs into `viz/`: training curves, weight panels, ground-truth and generated piano rolls.
`make_blues_improvisation_gif.py`	Renders `blues_improvisation.gif` — training-time evolution of the generated chorus.
`blues_improvisation.gif`	Animation (chord track + piano roll + loss curves) over 21 epoch snapshots.
`viz/training_curves.png`	total / chord-head / pitch-head loss + per-step argmax accuracy.
`viz/weight_matrices.png`	LSTM input weights (layer 1) and recurrent weights (layer 2), split per gate.
`viz/corpus_pianoroll.png`	One ground-truth training chorus rendered as a piano roll.
`viz/generated_pianoroll.png`	The free-running generated chorus.

Running

Reproduces the headline number end-to-end:

python3 blues_improvisation.py --seed 0 --epochs 200
python3 visualize_blues_improvisation.py --seed 0 --epochs 200
python3 make_blues_improvisation_gif.py --seed 0 --epochs 200 --snapshot-every 10

Wallclock on M-series laptop CPU (Python 3.12, numpy 2.4): training ≈ 3 s, viz ≈ 3 s, GIF ≈ 5 s. Total < 15 s.

Numerical gradient check (sanity for the manual BPTT):

python3 blues_improvisation.py --gradcheck
# → max relative error ≈ 1e-5 over 107 sampled weights

To inspect the synthesized corpus:

python3 blues_improvisation.py --print-corpus --seed 0

Results

	Value	Notes
Final teacher-forced chord-prediction acc	0.993	per-step argmax over 96 steps
Final teacher-forced pitch-prediction acc	0.372	upper-bound is ≈ 0.55 (training melodies are stochastic)
Bar-onset chord match (free-running, det.)	12 / 12	structural correctness
Step-level chord match (free-running, det.)	0.906
On-beat note rate (free-running)	0.792	strong-beat steps not REST
Chord-tone rate (free-running)	0.877	non-REST pitches in current chord’s root palette
Total wallclock (training only)	~3 s	seed 0, M-series laptop

Hyperparameters (all defaults, all in the CLI):

seed            = 0
h1 (chord)      = 20
h2 (melody)     = 24
n_pieces        = 8
epochs          = 200
batch           = 8
lr              = 8e-3, halved every 80 epochs
optimizer       = Adam, ε=1e-8, β=(0.9, 0.999), grad-norm clip = 2.0
gating          = LSTM with forget gate, forget-bias init = 1.0
loss            = CE(chord) + CE(pitch), mean over (T, B)
sampling        = chord temperature 0 (argmax), pitch temperature 0.85

The pitch-prediction accuracy plateaus around 0.37 because the training melodies are themselves stochastic (chord-tone with rest probability 0.20 on weak beats and ≈40% probability of a passing tone). 0.37 is well above the 1/8 ≈ 0.125 chance baseline shown as the dotted line in the accuracy plot.

Multi-seed sweep (200 epochs, 4 seeds):

seed	det. bar-onset	det. step-level	sampled bar-onset	sampled step-level
0	12/12	0.906	12/12	0.854
1	8/12	0.938	12/12	0.958
2	7/12	0.896	7/12	0.802
3	12/12	1.000	8/12	0.948

Free-running RNN generation has compounding-error sensitivity to the random initialisation, which is why bar-onset match varies across seeds. Step-level chord match is more stable (0.90–1.00). Seed 0 is the headline number.

Reproducibility env (seed 0 run captured above):

python    3.12.7
numpy     2.4.4
platform  macOS-26.3-arm64

Visualizations

viz/training_curves.png — left: cross-entropy loss split by head (chord head converges to ≈ 0.04 by epoch 100; pitch head bottoms at ≈ 1.65, the entropy floor of the stochastic training melody). Right: teacher-forced argmax accuracy. Chord accuracy passes 0.95 around epoch 40 and reaches 0.99 by epoch 200; pitch accuracy climbs from 0.16 (≈ chance) toward ≈ 0.37 (near the achievable ceiling given the corpus’s melody noise).

viz/weight_matrices.png — top row: layer-1 input weights W1x split by gate (input, forget, cell, output). The chord-input columns (the first 3 indices on the x-axis) have larger magnitudes in the input and forget gates: layer 1 is using its chord input strongly to drive its memory. Bottom row: layer-2 recurrent weights W2h. The diagonal-leaning structure in the cell-gate panel shows the melody layer’s self-coupling.

viz/corpus_pianoroll.png — one of the 8 ground-truth training choruses. The chord strip on top alternates blue/orange/green for C7/F7/G7. The piano roll below shows pitch on the y-axis (REST at top), each note as a dark rectangle one timestep wide.

viz/generated_pianoroll.png — the free-running generated chorus, same layout. The chord strip exactly matches the training pattern; the melody emphasises chord tones (notes line up with the chord’s root palette in the roll) on strong beats.

blues_improvisation.gif — 21 frames captured every 10 training epochs. Frame 1 (epoch 1): chord strip is single-coloured (the LSTM hasn’t learned to switch yet); melody is mostly REST. By frame 5 (epoch 50): bar 5 has turned orange (F7), bar 9 turns green (G7) by frame 8 (epoch 80). The piano roll fills in chord tones over time. The bottom panel shows the chord-head loss collapsing while the pitch-head loss declines slowly.

Deviations from the original

Stack instead of partition. Eck & Schmidhuber 2002 partition LSTM memory into a chord block and a melody block (with different time-scale biases) inside a single LSTM layer. We use a 2-layer stacked LSTM: layer 1 (H = 20) predicts chord, layer 2 (H = 24) takes layer 1’s hidden state and predicts pitch. Same intent (separate long-range chord memory from short-range melody memory), simpler implementation. Both variants share the structural property that the chord pathway can update independently of the melody pathway.
Forget-gate LSTM, not vanilla 1997. We use the Gers/Schmidhuber/ Cummins 2000 LSTM with a forget gate and bias init = 1. The 2002 blues paper used the same generation; this is consistent.
Synthetic corpus, not human MIDI. The 2002 paper trained on a small set of 12-bar choruses written by hand (Eck himself). We generate 8 choruses inside synth_corpus(), all sharing the canonical bebop-blues progression but with stochastic chord-tone-biased melodies. No external dataset.
Vocabulary size. We use 3 chords and 8 pitches (C blues scale across two octaves + REST) — coarser than the 12-pitch chromatic vocabulary in the original. The structural property (chord progression has period 96 steps and must be remembered against melody noise) is preserved.
Training schedule. 200 epochs of full-corpus BPTT with Adam, instead of the paper’s online BPTT with momentum. Adam is the standard recipe for these LSTM stubs across the wave (consistent with adding-problem, noise-free-long-lag, etc.); the paper’s exact hyperparameters are not load-bearing for the qualitative claim.
Sampling at generation time. For the headline metric (bar-onset chord match) we sample chord deterministically (argmax) and pitch stochastically (T = 0.85). The paper sampled both stochastically; we report sampled-both metrics in the script’s stdout for comparison (sampled bar-onset match: also 12/12 at seed 0; step-level: 0.854).

Open questions / next experiments

Two-mode v1.5: 12-pitch chromatic vocabulary. Expand the pitch alphabet to a full chromatic octave (or two). The qualitative claim should still hold but with worse pitch-accuracy ceiling. Useful for the v2 ByteDMD instrumentation since it inflates the cost of the pitch head.
Vanilla RNN baseline. The blues progression has a period of 96 steps. A vanilla RNN at this depth should fail to keep the chord stable beyond a few bars. We did not include the comparison run in this stub (added cost ≈ 2 s); a future PR could add it as a one-flag toggle, in the same shape as adding_problem.py --rnn.
Multi-chorus rollout. The 2002 paper reports the LSTM stays on the chord progression for hundreds of bars. The current stub generates one chorus (96 steps); a longer rollout would test long-horizon stability, particularly under chord_temperature > 0.
Why pitch-acc plateaus at 0.37. The achievable ceiling depends on the corpus generator (rest_prob_weak, chord_tone_strength, beat-1/5 weighting). A small ablation could confirm pitch-acc tracks the corpus entropy and is not a model-capacity bottleneck.
Melody emphasis variation. Eck & Schmidhuber 2002 also describe more melodically-shaped training data. Our hand-coded melodies are pentatonic-flavoured but not phrase-shaped (no anticipation, no resolution to root on bar 12). A v1.5 corpus generator with phrase-level structure would let us test whether the LSTM picks it up.
Citation gap on the original IDSIA report. The IDSIA-07-02 PDF is not always retrievable. Our reconstruction follows the published NNSP 2002 abstract and Eck’s later journal pieces.

Keyboard shortcuts

Schmidhuber Problems