Muon Optimizer Literature Review¶
You are contributing to the Sutro Group, a research lab studying energy-efficient AI training. Your task is a literature review of the Muon optimizer.
Project context¶
Read these files first:
- DISCOVERIES.md — what's already known
- LAB.md — lab rules (important: do NOT modify measurement code, tracker.py, cache_tracker.py, data.py, config.py, harness.py)
- findings/_template.md — reference for expected report format
- We optimize for energy efficiency, not just accuracy or speed
- Our benchmark: sparse parity (n=20 bits, k=3 secret, 17 noise)
- Our metric: ByteDMD (https://github.com/cybertronai/ByteDMD) — successor to DMC. Tracks memory at byte level (not per-element), using integer arithmetic. Reference implementation has been hardened against escape hatches (agents were bypassing TrackedArray via np.asarray() → Python ints). See also: https://github.com/cybertronai/ByteDMD-examples for test cases.
- Baseline SGD: ~0.12s wall-clock, high DMD (exact ByteDMD score unmeasured — you'd need to run the ByteDMD tracer)
- Best known: GF(2) Gaussian Elimination (~509us, low byte-level access count), KM-min (~1ms)
Task: Research the Muon optimizer¶
1. Read the Muon paper¶
Primary source: https://kellerjordan.github.io/posts/muon/
Key facts to extract: - The algorithm: Newton-Schulz iteration on SGD momentum updates to orthogonalize gradients - How it replaces Adam's element-wise adaptation with matrix orthogonalization - Performance claims (CIFAR-10, NanoGPT, 1.5B parameter models) - Computational overhead (<1% FLOP overhead claimed) - Scope: only for 2D hidden-layer weights, paired with AdamW for embeddings/biases
2. Analyze relevance to energy efficiency¶
Study the ByteDMD metric: https://github.com/cybertronai/ByteDMD
Investigate: - Does Newton-Schulz iteration reduce byte-level data movement compared to Adam's moment tracking (first + second moment buffers)? - What's the FLOP/memory tradeoff? Does it reduce byte accesses at the cost of more compute? - Does orthogonalization change reuse distance patterns compared to Adam or SGD? - Would Muon help on small networks (hidden=200, our sparse parity setup) or only large LLMs?
3. Write findings¶
Create findings/exp_muon_review.md following this structure:
# Muon Optimizer — Literature Review
## Hypothesis
[What we expect Muon to help with (or not) for energy efficiency]
## Key Facts from Paper
- [5-10 bullets on algorithm, results, limitations]
## Relevance to Sutro Group
[Would Muon improve energy efficiency on sparse parity? Why/why not?]
## Comparison to Our Methods
[How Muon relates to our best methods: SGD, GF(2), KM-min]
## Open Questions
[What we'd need to test experimentally to know for sure]
## References
- [Links to paper, code, follow-ups]
Rules¶
- DO NOT modify any experiment/run code (tracker.py, cache_tracker.py, data.py, config.py, harness.py, fast.py, src/)
- DO NOT run experiments — this is a literature review only
- Write findings to
findings/exp_muon_review.md - Check
DISCOVERIES.mdfirst to avoid repeating known results - Do NOT change LAB.md, DISCOVERIES.md, CLAUDE.md, CODEX.md, or any other project configuration