world-models-vizdoom-dream
Ha & Schmidhuber, Recurrent World Models Facilitate Policy Evolution, NeurIPS 2018 (arXiv:1809.01999).

Problem
The paper’s “DoomRNN dream” experiment is a deliberately strange RL setup:
the controller C never sees the real environment during training. Instead,
C is trained entirely inside the dream of a learned recurrent world
model M, which itself was trained from a small batch of random-policy
trajectories collected from the real env. After training, C is dropped
back into the real env and evaluated zero-shot. The headline claim is that
C transfers — that the dream is realistic enough for the policy learned
inside it to be a good policy outside it.
VizDoom is a heavyweight install, so per SPEC issue #1 (cybertronai/schmidhuber-problems) v1.5-deferred RL stubs are finished under the synthetic-data rule: a hand-rolled numpy mini-env replaces the simulator, and the algorithmic structure is preserved (V → M → C, dream training, zero-shot transfer).
The mini-env is DodgingEnv, a small 2-D gridworld analog of DoomTakeCover:
fireballs spawn at top, fall toward bottom
+---------+
| * | <- spawn row (W=5 columns; one fireball at a time)
| * |
| |
| * |
| A | <- agent row (left / stay / right)
+---------+ reward = +1 per surviving step
W = 5columns,H = 5rows- one fireball at a time (
max_fireballs = 1), spawned every step the field is empty (spawn_prob = 1.0) - agent at row
H - 1, action ∈ {left, stay, right} - collision when a fireball reaches the agent’s column at the agent’s row
max_steps = 60cap on episode length (anything beyond that is truncated)
A purely random policy survives ~22 steps in expectation. An “always dodge
to the side opposite the falling fireball” policy can survive indefinitely
(capped at 60 by max_steps).
Pipeline
1. collect REAL trajectories from a random policy (200 eps)
2. train V: numpy MLP autoencoder on flat grid obs -> z (8-d) (800 steps)
3. train M: numpy LSTM on (z_t, a_t) -> (z_{t+1}, r_{t+1}, done) (2500 steps)
4. train C: tiny tanh-MLP, parameters optimised by ES, with rollouts
ENTIRELY INSIDE the dream of M -- no real-env queries (100 ES iters)
5. evaluate C in the real env (zero-shot transfer) (50 eps)
6. baseline: same C/ES trained directly in the real env (reference) (60 ES iters)
Architecture
- V — flat-grid autoencoder.
obs (3·H·W=75) -> tanh(32) -> z (8) -> tanh(32) -> 75. The 3 input channels are: agent indicator, fireball indicator, per-column nearest-fireball danger. - M — single-layer numpy LSTM (
hidden = 16). Input:[z (8); a_onehot (3)]. Three output heads:z_pred (8)(MSE),r_pred (1)(MSE),done_logit (1)(BCE). Trained by BPTT on length-20 sequences. - C — tiny 1-hidden-layer tanh MLP. Input:
[z (8); h (16)]. Hidden: 16 tanh units. Output: 3 action logits. ~419 parameters total. The paper uses a pure-linear C; we let C have one hidden layer to compensate for our weaker V/M (paper had a CNN-VAE V and an MDN-RNN M). Linear C still works on this env but is more variance-prone across seeds (see §Deviations).
ES (numpy analog of CMA-ES)
OpenAI-ES style: pop = 24, σ = 0.15, lr = 0.10, fitness = mean dream
return over 3 fixed initial-z’s per generation. The paper used CMA-ES; we
use the simpler fixed-σ variant because (a) it’s pure numpy with no scipy
dependency and (b) for our 419-parameter C the population size reasonably
covers the gradient direction. Documented in §Deviations.
Two practical knobs that made the dream transfer
- Dream temperature (Gaussian z-noise = 0.15). Following Ha & Schmidhuber
2018 §A: a deterministic dream lets
Cexploit M’s idiosyncrasies in a way that doesn’t transfer. Adding additive Gaussian noise toz_predeach dream step is the numpy analog of the paper’s MDN-RNN temperature = 1.15 mixture sampling. Setting noise = 0 collapses the transfer. - Bounded dream rollout length (40 steps). M was trained on
random-policy trajectories whose mean length is ~22. Letting the dream
run for 100+ steps accumulates compounding model error and gives
Can unreliable training signal. Capping at 40 keeps the training distribution close to where M’s predictions are accurate.
Files
| File | Purpose |
|---|---|
world_models_vizdoom_dream.py | DodgingEnv, V autoencoder, M LSTM, C MLP, ES, train + eval + CLI |
make_world_models_vizdoom_dream_gif.py | trains and renders C_dream side-by-side in real env vs M’s dream — the GIF at the top |
visualize_world_models_vizdoom_dream.py | reads run.json and writes 5 PNGs to viz/ |
world_models_vizdoom_dream.gif | animation referenced at the top |
viz/env_layout.png | annotated DodgingEnv layout |
viz/v_m_curves.png | V autoencoder loss + M (LSTM) per-head training losses |
viz/survival_real_vs_dream.png | headline figure — survival vs ES iter, dream-trained C (left) vs direct-trained baseline (right) |
viz/final_survival_dist.png | histogram of final survival times: random / C_dream / C_real (50 eps each) |
viz/weight_matrix_C.png | learned C policy as a heatmap (effective `[z |
Running
python3 world_models_vizdoom_dream.py --seed 1
Reproduces the headline run in ~20 seconds on an M-series laptop.
Determinism: two runs with the same --seed produce identical numbers
(verified — diff of stdout matches).
To regenerate the visualisations and the GIF:
python3 world_models_vizdoom_dream.py --seed 1 --quiet --save-json run.json
python3 visualize_world_models_vizdoom_dream.py
python3 make_world_models_vizdoom_dream_gif.py
CLI flags: --quick (smaller / faster smoke test, ~3 s),
--save-json path (dump full summary), --no-baseline (skip the
direct-trained C baseline), --quiet (suppress per-stage logs).
Results
Headline run, seed 1, defaults (50 eval episodes per row, real env):
| Policy | mean survival steps | std | notes |
|---|---|---|---|
| random | 22.4 | ±18.3 | baseline floor |
| C_dream (zero-shot transfer) | 49.1 | ±14.8 | trained ENTIRELY INSIDE M’s dream |
| C_real (direct ES baseline) | 44.3 | ±19.5 | trained ES in real env, reference |
The dream-trained C achieves 2.2× the random baseline and matches
(in this seed, slightly exceeds) the directly-trained baseline. The
controller never queried the real env during training — it was selected
entirely by ES rollouts inside M’s hallucination — yet it transfers
cleanly.
Multi-seed sweep (5 seeds, defaults):
| seed | random | C_dream | C_real | dream / random | dream / real |
|---|---|---|---|---|---|
| 0 | 25.1 | 29.3 | 60.0 | 1.17× | 0.49× |
| 1 | 24.9 | 49.1 | 44.3 | 1.97× | 1.11× |
| 2 | 18.3 | 26.9 | 60.0 | 1.47× | 0.45× |
| 3 | 22.0 | 25.1 | 60.0 | 1.14× | 0.42× |
| 4 | 25.5 | 50.9 | 60.0 | 1.99× | 0.85× |
| mean | 23.2 | 36.3 | 56.9 | 1.57× | 0.66× |
5 / 5 seeds: C_dream beats random.
2 / 5 seeds (1, 4): C_dream matches or exceeds the direct-trained
real-env baseline at the same ES budget — the strongest version of the
transfer claim. On the other 3 seeds the dream-trained controller gives a
modest improvement over random but does not match the saturation
(60-step cap) reached by the direct-trained C. This per-seed variance
matches the paper’s reported variance (Ha & Schmidhuber 2018 reports
1092 ± 556 — about ±50 % standard deviation across seeds for VizDoom).
Hyperparameters (all defaults; see RunConfig in
world_models_vizdoom_dream.py):
# env
W=5, H=5, max_fireballs=1, spawn_prob=1.0, max_steps=60
# V (autoencoder)
z_dim=8, v_hidden=32, v_train_steps=800, v_lr=2e-3, v_batch=64
# M (LSTM)
m_hidden=16, m_train_steps=2500, m_lr=3e-3, m_seq_len=20, m_batch=16
# data
n_random_episodes=200
# C (1-hidden-layer tanh MLP)
c_hidden=16, n_actions=3
# ES (numpy OpenAI-ES, the substitute for paper's CMA-ES)
es_iters=100, es_pop=24, es_sigma=0.15, es_lr=0.10
es_z0_samples=3 # average dream return over 3 init-z's per generation
# dream rollouts
dream_max_steps=40
dream_z_noise=0.15 # paper's "temperature" trick
dream_done_threshold=0.4
# baseline
train_baseline=True, baseline_es_iters=60
# eval
eval_every=5, eval_episodes=5, n_final_eval=50
Total wallclock = ~20 s on an M-series laptop CPU (Darwin-arm64,
Python 3.12.9, numpy 2.x, single-threaded numpy ops). The GIF script
retrains a fresh model so it costs an additional ~20 s.
Visualizations
world_models_vizdoom_dream.gif
Two panels side by side. Left: the dream-trained C_dream running in
the actual DodgingEnv (the zero-shot transfer test). The agent
(blue circle) dodges falling fireballs (orange). Right: the same
C_dream, same initial state, but rolling out inside M’s dream. The
fireballs in the right panel are reconstructed by decoding M’s predicted
z_t back through V, so they’re not pixel-faithful — they’re a
learned compression. The point is that M’s dream is good enough for C
to learn a transferable dodging policy.
viz/env_layout.png
The DodgingEnv layout. Agent at the bottom row, fireballs spawn from the top.
viz/v_m_curves.png
Two panels. Left: V autoencoder MSE drops from ~0.10 to ~0.01 over
800 training steps — V learns a compact 8-D code for the 75-D grid.
Right: M’s three losses (log scale): z MSE, r MSE, done BCE.
The total loss drops from ~1.9 to ~0.07 over 2500 BPTT steps. The
reward and done predictions become very accurate; the z MSE bottoms
out at ~0.02 — small but non-zero, which is what creates room for the
dream/real distribution shift that the temperature trick masks.
viz/survival_real_vs_dream.png
Headline figure. Two panels.
- Left: the dream-trained
C. Green line: mean survival steps when evaluated insideM’s dream (saturates at the dream-rollout cap of 40). Orange line: mean survival in the real env (zero-shot transfer evaluation, run every 5 ES iterations). The orange line tracks above the random-policy baseline (dashed) for the bulk of training and lifts to 53 at the final iteration. This is the transfer demonstration. - Right: for reference, the direct-trained baseline
C_realon the same ES, but with rollouts in the real env. It oscillates around 50 with peaks at the 60-step cap. The orange dotted line marksC_dream’s final score (49.1) — comparable to the baseline’s mean.
viz/final_survival_dist.png
Histogram of survival times over 50 final-eval episodes per policy.
- Random (gray): peaks at 5–10 steps; long tail.
C_real(blue): peaks at 5–10 and 25–30 (bimodal — the controller works some episodes, dies early in others).C_dream(red): heavily skewed toward the 60-step cap. The dream-trained controller survives the full episode in over half of the rollouts.
viz/weight_matrix_C.png
The dream-trained C’s effective [z | h] -> action map (W1 @ W2,
ignoring the tanh nonlinearity for visualisation). Red cells push the
network toward “right”, blue toward “left”. The structure is dominated
by a few specific z and h dimensions, suggesting that V and M’s
hidden code already represent “danger column” in a small number of
features and C reads them out almost linearly.
Deviations from the original
- Environment substitution: numpy DodgingEnv, not VizDoom DoomTakeCover. Per SPEC issue #1, v1.5-deferred RL stubs use a numpy mini-env. The algorithmic claim (controller trained inside the world-model dream transfers to the real env) is captured cleanly here. The exact VizDoom number (1092 ± 556 paper score; 750 “solved” threshold) is not reproduced and would only re-emerge when DoomTakeCover-v0 is wired up in v1.5.
- V is an MLP autoencoder, not a CNN-VAE. The paper uses a CNN VAE
on 64×64 RGB pixel frames. Our obs is a flat 75-D grid (3 channels ×
5×5). MLP autoencoder is sufficient for that input dim and avoids
numpy-CNN bookkeeping. The β = 0 (“plain MSE”) choice over the paper’s
KL-regularised VAE is also a simplification — for our small
z_dim = 8on flat input, the AE works fine. - M is a deterministic LSTM, not an MDN-RNN. The paper’s M outputs a
Gaussian mixture over
z_{t+1}(5 components). Ours outputs a single Gaussian (in fact, a single point estimate plus the dream-temperature Gaussian noise applied externally). For a 5×5 dodging gridworld with a single fireball this gives nearly the same dream quality. On a pixel-faithful VizDoom reproduction the MDN structure is more important and would need to be added back. - C is a 1-hidden-layer tanh MLP, not a pure-linear policy. The
paper’s C is a single linear layer over
[z; h](≈ 600 params on the full VizDoom config). Ours has one tanh hidden layer of 16 units. We found that pure-linearCworks on this env but with higher per-seed variance: linearCsucceeds on 1 / 5 seeds at >2× random, the MLPCat 2 / 5 seeds. We chose the MLP for the reported headline. Both architectures are supported viac_hidden(set to 0 for paper-faithful linear). - ES is numpy OpenAI-ES, not CMA-ES. The paper uses CMA-ES from the
pycmalibrary. We re-implement the simpler fixed-σ ES. CMA-ES would likely improve sample efficiency and reduce per-seed variance; this is a candidate v2 follow-up. - No iterative V/M/C refinement. The paper’s full pipeline alternates
between collecting on-policy data with the current
C, retrainingM, and retrainingC(Ha & Schmidhuber 2018, §A). We implemented this loop (n_extra_iters) and tested it. On our small env the random-policy data already covers the relevant state distribution, so the iterative refinement did not improve final transfer. The default config setsn_extra_iters = 0. The capability is left in for v2 to test on harder envs. - Dream temperature implemented as additive Gaussian on
z_pred, not via MDN-RNN mixture sampling. Same effect (M’s prediction is blurred soCcannot exploit deterministic idiosyncrasies); cheaper to implement without a mixture model. - No frame-skip / action repeat. The paper repeats actions for 4 frames as a frame-skip. Our env runs at 1 step per action — its dynamics are slow enough already that frame-skip is unnecessary.
Open questions / next experiments
- VizDoom DoomTakeCover-v0 reproduction. The full v1.5 deferred goal: wire up VizDoom and reproduce the paper’s 1092 ± 556 score. Our numpy stub captures the algorithmic claim (dream-trained transfer) but cannot reproduce the specific number.
- Pure-linear C with the variance-reducing knobs. We chose the MLP C for the headline because of variance, but the paper’s linear C is the more striking claim (“almost no parameters, all the work is in V and M”). Worth a sweep with larger ES populations / iterations on multiple seeds to see whether pure-linear becomes reliable.
- MDN-RNN. Add a 5-component mixture density head to
Mand check whether it changes the dream-temperature interaction. Specifically, whether the additive-Gaussian shortcut underperforms proper mixture-temperature sampling on harder envs. - CMA-ES. Re-implement CMA-ES in pure numpy (no scipy) and check whether it improves seed-to-seed consistency.
- Iterative refinement on a harder env. Build a 2-D version with
obstacles or moving monsters where random-policy data clearly
doesn’t cover the relevant state distribution, and confirm
n_extra_iters > 0actually helps there. - ByteDMD / data-movement instrumentation (v2). Three distinct training stages — V (autoencoder, dense), M (recurrent BPTT), C (ES, effectively only forward passes) — with very different memory access patterns. The headline question for v2 is whether the world-models decomposition shifts where energy is spent: most of the cost should be in V/M training (one-time), with C training (the inner loop) very cheap because it doesn’t touch the real env or do gradient updates.