Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

highway-networks

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. NIPS 2015 (arXiv:1507.06228).

highway-networks training dynamics

Problem

A highway layer adds a learned gating mechanism to a feedforward block:

y = H(x) * T(x)  +  x * (1 - T(x))

H(x) = tanh(W_H x + b_H) is the transform branch and T(x) = sigmoid(W_T x + b_T) is the transform gate. The complementary (1 - T(x)) is the carry gate. Initialising b_T negative (we use -2.0, paper uses -1 to -4) makes a fresh highway block start close to the identity, so a randomly-initialised stack of N highway layers behaves at init like an unrolled near-identity chain. Information and gradients can flow end-to-end through the carry path, sidestepping the vanishing-gradient pathology that prevents very deep plain feedforward nets (with saturating nonlinearities) from training.

This stub reproduces the paper’s headline contrast on MNIST: at the same depth, same width, same activation, same optimiser, plain MLPs fail to train past ~5–10 layers, while highway nets train cleanly at depth 50.

Architecture

BlockShapeActivation
input projection784 → 50tanh
N hidden blocks50 → 50 (each)tanh inside H; sigmoid in T
output50 → 10softmax + cross-entropy

For the plain baseline, each hidden block is tanh(W x + b) with no skip; otherwise everything (depth, width, init scale, optimiser, batches, seed, dataset slice) is identical.

Files

FilePurpose
highway_networks.pyMNIST loader (idx files, cached at ~/.cache/hinton-mnist/), DeepNet class with block ∈ {highway, plain}, manual forward + backward pass, gradient-clipped Adam, headline contrast trainer + depth sweep + multi-seed support. CLI with --seed, --depth, --depths, --quick.
visualize_highway_networks.pyReads run.json and run_sweep.json and writes 5 PNGs to viz/.
make_highway_networks_gif.pyBuilds highway_networks.gif from per-epoch snapshots in run.json.
run.jsonHeadline result: depth 30, seed 0 (committed).
run_sweep.jsonDepth sweep over {5, 10, 20, 30, 50}, seed 0 (committed).
highway_networks.gifTraining-dynamics animation (12 frames, 106 KB).
viz/5 static PNGs (see below).

Running

Headline run (≈ 7 s on M-series CPU):

python3 highway_networks.py --seed 0

Depth sweep used in §Results table (≈ 60 s):

python3 highway_networks.py --seed 0 --depths 5,10,20,30,50 --out run_sweep.json

Quick smoke (depth 10, 5 epochs, ≈ 0.5 s):

python3 highway_networks.py --seed 0 --quick

Then regenerate viz:

python3 visualize_highway_networks.py
python3 make_highway_networks_gif.py

MNIST is loaded from ~/.cache/hinton-mnist/ if present (idx-format gzipped files, the same cache layout used by hinton-problems). If absent, the loader downloads from the public OSSCI MNIST mirror to that cache; subsequent runs reuse the cache.

Results

Single-seed headline (--seed 0 --depth 30 --hidden 50 --epochs 12 --batch 128 --lr 5e-3 --n-train 6000 --n-test 2000):

NetFinal test accFinal train lossWallclock
highway, depth 300.9260.1894.9 s
plain, depth 300.124 (≈ chance)2.302 ≈ log(10)1.9 s

The plain net’s training loss stays pinned at log(10) ≈ 2.303 (uniform over 10 classes) for the entire run — gradients vanish through 30 saturating tanh layers, the output never decorrelates from chance.

Depth sweep (same hyperparameters, seed 0):

DepthHighway test accPlain test accHighway train lossPlain train loss
50.9030.8570.1900.478
100.9130.2920.1871.773
200.9100.0980.2152.303
300.9260.1240.1892.302
500.9050.1240.3012.302

Plain MLP holds at depth 5, partially trains at depth 10, completely fails at depth ≥ 20 (test accuracy stuck at chance; loss stuck at log(10)). Highway net is essentially flat across the whole sweep — depth costs nothing.

Multi-seed verification at depth 30 (3 seeds, default settings; not saved):

SeedHighway test accPlain test acc
00.9260.124
10.9040.119
20.8930.111

3/3 seeds produce the same headline ordering with no overlap between highway and plain accuracies.

Hyperparameters

ParameterValue
optimiserAdam, β₁=0.9, β₂=0.999, ε=1e-8
learning rate5e-3
gradient clip (L2)5.0
batch size128
epochs12
n_train6 000 (random subset of 60 k MNIST training set)
n_test2 000 (random subset of 10 k MNIST test set)
hidden width50
activation in Htanh
transform-gate bias init−2.0
weight inituniform ± 1/√fan_in
seed0 (CLI flag)

Visualizations

FileWhat it shows
viz/learning_curves.pngTest accuracy per epoch, highway vs plain at depth 30. Highway climbs to 0.93; plain hugs the chance line.
viz/plain_loss_collapse.pngTrain loss per epoch. Plain loss flat at log(10) (no signal); highway descends from 1.6 to 0.19.
viz/depth_sweep.pngFinal test accuracy as a function of depth (5 → 50). Highway is roughly flat at ~0.91. Plain crashes from 0.86 (depth 5) to chance (depth 20+).
viz/T_gate_evolution.pngPer-layer mean(T) on a held-out batch, plotted over training. Lower layers (input side) develop higher T (more transform); upper layers (output side) keep T low and rely on the carry path.
viz/T_gate_final.pngFinal per-layer mean(T) at depth 30. Bars vs the init T = sigmoid(−2) ≈ 0.119 baseline. The transform gate has learned a per-layer schedule from data.
highway_networks.gif12-frame animation: top panel grows the test-accuracy curves frame by frame; bottom panel updates the per-layer T-gate bar chart. Visualises both the headline contrast and the gate’s gradual specialisation.

Deviations from the original

WhatPaperHereWhy
Activation in Hmostly Maxout (and ReLU in some figures)tanhThe paper’s central failure-of-plain-nets demonstration uses saturating nonlinearities (Fig 2 caption uses sigmoid/tanh). Tanh makes the contrast crisp on a laptop budget; ReLU plain nets train at modest depth even without skips, which would obscure the headline.
Width50–71 units (their MNIST table 1 uses 50)50Matches the paper’s MNIST setup.
Depthsweep 10/20/50/100 (with 50 the headline FC point)sweep 5/10/20/30/50; headline 30100-layer manual numpy BPTT is feasible but exceeds the wave’s wallclock target. The contrast saturates by depth 20, so 30/50 already make the point.
OptimiserSGD-momentum, hand-scheduled LRAdam, fixed LR=5e-3Faster, no schedule tuning, well within the spec’s pure-numpy + matplotlib constraint.
Training setfull 60 k MNISTrandom 6 k subset (seeded)Keeps headline run < 10 s. The contrast (highway trains, plain fails at chance loss) is depth-driven, not data-driven; we verified this on 3 seeds.
Test setfull 10 krandom 2 k subset (seeded)Variance check: 3 seeds give consistent ranking.
b_T init−1 to −4−2.0Mid of paper range.
H weight initsmall Gaussianuniform ± 1/√fan_inStandard for tanh; matches the rest of this catalog.
Conv-highway on CIFAR-10/100yes (paper Sec 5)not in v1Out of scope for this stub; CIFAR-conv lives in mcdnn-image-bench.

Open questions / next experiments

  • Reproduce the 100-layer claim. The paper’s signature image is the 100-layer FC highway net training on MNIST. We stop at depth 50 to fit the wave budget; a 100-layer run on the full 60 k training set under the paper’s SGD-momentum schedule is the natural follow-up.
  • Convolutional highway on CIFAR. Sec 5 of the paper trains 19- and 32-layer conv highways to 7.6 % / 32.24 % on CIFAR-10/100. Pure-numpy conv is heavy but tractable; v1.5 candidate.
  • Block-wise highway vs ResNet vs LSTM. The Srivastava paper notes the link to LSTM gating; a controlled side-by-side of (highway, residual y = x + H(x), plain) at matched depth on the same task would isolate what the gate buys you over a fixed identity skip.
  • ByteDMD instrumentation (v2). Highway carry paths might trace different memory access patterns than plain MLPs of the same depth. Whether the carry path saves data movement (vs just gradient flow) is open and exactly the question wave-9 sets up.
  • What does T learn? The paper inspects T-gate activity per example and finds it routes different inputs through different layer-paths. We log mean(T) per layer but not per-example; an extension would dump full T tensors and cluster the routing patterns.