highway-networks
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. NIPS 2015 (arXiv:1507.06228).

Problem
A highway layer adds a learned gating mechanism to a feedforward block:
y = H(x) * T(x) + x * (1 - T(x))
H(x) = tanh(W_H x + b_H) is the transform branch and
T(x) = sigmoid(W_T x + b_T) is the transform gate. The complementary
(1 - T(x)) is the carry gate. Initialising b_T negative (we use
-2.0, paper uses -1 to -4) makes a fresh highway block start close
to the identity, so a randomly-initialised stack of N highway layers
behaves at init like an unrolled near-identity chain. Information and
gradients can flow end-to-end through the carry path, sidestepping the
vanishing-gradient pathology that prevents very deep plain feedforward
nets (with saturating nonlinearities) from training.
This stub reproduces the paper’s headline contrast on MNIST: at the same depth, same width, same activation, same optimiser, plain MLPs fail to train past ~5–10 layers, while highway nets train cleanly at depth 50.
Architecture
| Block | Shape | Activation |
|---|---|---|
| input projection | 784 → 50 | tanh |
N hidden blocks | 50 → 50 (each) | tanh inside H; sigmoid in T |
| output | 50 → 10 | softmax + cross-entropy |
For the plain baseline, each hidden block is tanh(W x + b) with no
skip; otherwise everything (depth, width, init scale, optimiser, batches,
seed, dataset slice) is identical.
Files
| File | Purpose |
|---|---|
highway_networks.py | MNIST loader (idx files, cached at ~/.cache/hinton-mnist/), DeepNet class with block ∈ {highway, plain}, manual forward + backward pass, gradient-clipped Adam, headline contrast trainer + depth sweep + multi-seed support. CLI with --seed, --depth, --depths, --quick. |
visualize_highway_networks.py | Reads run.json and run_sweep.json and writes 5 PNGs to viz/. |
make_highway_networks_gif.py | Builds highway_networks.gif from per-epoch snapshots in run.json. |
run.json | Headline result: depth 30, seed 0 (committed). |
run_sweep.json | Depth sweep over {5, 10, 20, 30, 50}, seed 0 (committed). |
highway_networks.gif | Training-dynamics animation (12 frames, 106 KB). |
viz/ | 5 static PNGs (see below). |
Running
Headline run (≈ 7 s on M-series CPU):
python3 highway_networks.py --seed 0
Depth sweep used in §Results table (≈ 60 s):
python3 highway_networks.py --seed 0 --depths 5,10,20,30,50 --out run_sweep.json
Quick smoke (depth 10, 5 epochs, ≈ 0.5 s):
python3 highway_networks.py --seed 0 --quick
Then regenerate viz:
python3 visualize_highway_networks.py
python3 make_highway_networks_gif.py
MNIST is loaded from ~/.cache/hinton-mnist/ if present (idx-format
gzipped files, the same cache layout used by hinton-problems). If
absent, the loader downloads from the public OSSCI MNIST mirror to that
cache; subsequent runs reuse the cache.
Results
Single-seed headline (--seed 0 --depth 30 --hidden 50 --epochs 12 --batch 128 --lr 5e-3 --n-train 6000 --n-test 2000):
| Net | Final test acc | Final train loss | Wallclock |
|---|---|---|---|
| highway, depth 30 | 0.926 | 0.189 | 4.9 s |
| plain, depth 30 | 0.124 (≈ chance) | 2.302 ≈ log(10) | 1.9 s |
The plain net’s training loss stays pinned at log(10) ≈ 2.303 (uniform
over 10 classes) for the entire run — gradients vanish through 30
saturating tanh layers, the output never decorrelates from chance.
Depth sweep (same hyperparameters, seed 0):
| Depth | Highway test acc | Plain test acc | Highway train loss | Plain train loss |
|---|---|---|---|---|
| 5 | 0.903 | 0.857 | 0.190 | 0.478 |
| 10 | 0.913 | 0.292 | 0.187 | 1.773 |
| 20 | 0.910 | 0.098 | 0.215 | 2.303 |
| 30 | 0.926 | 0.124 | 0.189 | 2.302 |
| 50 | 0.905 | 0.124 | 0.301 | 2.302 |
Plain MLP holds at depth 5, partially trains at depth 10, completely fails at depth ≥ 20 (test accuracy stuck at chance; loss stuck at log(10)). Highway net is essentially flat across the whole sweep — depth costs nothing.
Multi-seed verification at depth 30 (3 seeds, default settings; not saved):
| Seed | Highway test acc | Plain test acc |
|---|---|---|
| 0 | 0.926 | 0.124 |
| 1 | 0.904 | 0.119 |
| 2 | 0.893 | 0.111 |
3/3 seeds produce the same headline ordering with no overlap between highway and plain accuracies.
Hyperparameters
| Parameter | Value |
|---|---|
| optimiser | Adam, β₁=0.9, β₂=0.999, ε=1e-8 |
| learning rate | 5e-3 |
| gradient clip (L2) | 5.0 |
| batch size | 128 |
| epochs | 12 |
| n_train | 6 000 (random subset of 60 k MNIST training set) |
| n_test | 2 000 (random subset of 10 k MNIST test set) |
| hidden width | 50 |
| activation in H | tanh |
| transform-gate bias init | −2.0 |
| weight init | uniform ± 1/√fan_in |
| seed | 0 (CLI flag) |
Visualizations
| File | What it shows |
|---|---|
viz/learning_curves.png | Test accuracy per epoch, highway vs plain at depth 30. Highway climbs to 0.93; plain hugs the chance line. |
viz/plain_loss_collapse.png | Train loss per epoch. Plain loss flat at log(10) (no signal); highway descends from 1.6 to 0.19. |
viz/depth_sweep.png | Final test accuracy as a function of depth (5 → 50). Highway is roughly flat at ~0.91. Plain crashes from 0.86 (depth 5) to chance (depth 20+). |
viz/T_gate_evolution.png | Per-layer mean(T) on a held-out batch, plotted over training. Lower layers (input side) develop higher T (more transform); upper layers (output side) keep T low and rely on the carry path. |
viz/T_gate_final.png | Final per-layer mean(T) at depth 30. Bars vs the init T = sigmoid(−2) ≈ 0.119 baseline. The transform gate has learned a per-layer schedule from data. |
highway_networks.gif | 12-frame animation: top panel grows the test-accuracy curves frame by frame; bottom panel updates the per-layer T-gate bar chart. Visualises both the headline contrast and the gate’s gradual specialisation. |
Deviations from the original
| What | Paper | Here | Why |
|---|---|---|---|
Activation in H | mostly Maxout (and ReLU in some figures) | tanh | The paper’s central failure-of-plain-nets demonstration uses saturating nonlinearities (Fig 2 caption uses sigmoid/tanh). Tanh makes the contrast crisp on a laptop budget; ReLU plain nets train at modest depth even without skips, which would obscure the headline. |
| Width | 50–71 units (their MNIST table 1 uses 50) | 50 | Matches the paper’s MNIST setup. |
| Depth | sweep 10/20/50/100 (with 50 the headline FC point) | sweep 5/10/20/30/50; headline 30 | 100-layer manual numpy BPTT is feasible but exceeds the wave’s wallclock target. The contrast saturates by depth 20, so 30/50 already make the point. |
| Optimiser | SGD-momentum, hand-scheduled LR | Adam, fixed LR=5e-3 | Faster, no schedule tuning, well within the spec’s pure-numpy + matplotlib constraint. |
| Training set | full 60 k MNIST | random 6 k subset (seeded) | Keeps headline run < 10 s. The contrast (highway trains, plain fails at chance loss) is depth-driven, not data-driven; we verified this on 3 seeds. |
| Test set | full 10 k | random 2 k subset (seeded) | Variance check: 3 seeds give consistent ranking. |
b_T init | −1 to −4 | −2.0 | Mid of paper range. |
H weight init | small Gaussian | uniform ± 1/√fan_in | Standard for tanh; matches the rest of this catalog. |
| Conv-highway on CIFAR-10/100 | yes (paper Sec 5) | not in v1 | Out of scope for this stub; CIFAR-conv lives in mcdnn-image-bench. |
Open questions / next experiments
- Reproduce the 100-layer claim. The paper’s signature image is the 100-layer FC highway net training on MNIST. We stop at depth 50 to fit the wave budget; a 100-layer run on the full 60 k training set under the paper’s SGD-momentum schedule is the natural follow-up.
- Convolutional highway on CIFAR. Sec 5 of the paper trains 19- and 32-layer conv highways to 7.6 % / 32.24 % on CIFAR-10/100. Pure-numpy conv is heavy but tractable; v1.5 candidate.
- Block-wise highway vs ResNet vs LSTM. The Srivastava paper notes
the link to LSTM gating; a controlled side-by-side of (highway,
residual
y = x + H(x), plain) at matched depth on the same task would isolate what the gate buys you over a fixed identity skip. - ByteDMD instrumentation (v2). Highway carry paths might trace different memory access patterns than plain MLPs of the same depth. Whether the carry path saves data movement (vs just gradient flow) is open and exactly the question wave-9 sets up.
- What does T learn? The paper inspects T-gate activity per example and finds it routes different inputs through different layer-paths. We log mean(T) per layer but not per-example; an extension would dump full T tensors and cluster the routing patterns.