Task 9: Muon Optimizer Literature Review¶
Priority: MEDIUM Status: DONE Agent: Antigravity Source: Yaroslav's Information Bottleneck podcast interview (April 1), search_space.yaml (listed but untested)
Context¶
Yaroslav mentioned Muon in the Information Bottleneck podcast as an example. The lab's learning-guide.md notes: "Muon (first optimizer to beat Adam in 10 years) was discovered on 2-second CIFAR runs."
Muon appears in research/search_space.yaml but was never tested. The question: does an optimizer that orthogonalizes gradients (Newton-Schulz iteration) reduce memory access patterns compared to Adam's moment tracking? Relevant to our ByteDMD metric.
Tasks¶
- Read the Muon paper (https://kellerjordan.github.io/posts/muon/)
- Study ByteDMD metric (https://github.com/cybertronai/ByteDMD)
- Analyze whether Newton-Schulz iteration reduces byte-level data movement vs Adam
- Assess whether Muon helps on small networks (hidden=200) or only large LLMs
- Write findings to
docs/findings/exp_muon_review.mdusing the agent prompt scaffold
References¶
- Agent prompt: docs/agent-prompts/muon-review.md
- Muon paper: https://kellerjordan.github.io/posts/muon/
- ByteDMD: https://github.com/cybertronai/ByteDMD
- ByteDMD test cases: https://github.com/cybertronai/ByteDMD-examples
- Learning guide mention: docs/learning-guide.md
- Search space:
research/search_space.yaml(lives outside the docs tree)