Context¶
What is this?¶
A research environment for the Sutro Group's work on energy-efficient AI training. The group's thesis: go back to 1960s-era AI problems and reinvent learning algorithms using modern tools (AI agents, compute), with energy efficiency as the optimization target.
Why Sparse Parity?¶
Sparse parity is the "drosophila" of learning tasks:
- Simplest non-trivial learning problem (XOR was the example Minsky used to trigger the AI winter)
- Easy to scale difficulty (add noise bits)
- Fast to iterate (0.12s with numpy, <2s even in pure Python)
- Exposes memory access patterns in backprop
- Well-studied in theory (Barak et al. 2022, Kou et al. 2024)
What We Found¶
33 experiments across two phases. Full ranked results in the Practitioner's Field Guide.
Phase 1: SGD optimization (16 experiments)¶
The 20-bit problem was "unsolvable" at LR=0.5. At LR=0.1 it solves in 5 epochs. From there we optimized ARD within the SGD framework, hitting a ceiling at ~10% because W1 accounts for 75% of all float reads. Forward-Forward has 25x worse ARD than backprop for 2-layer networks. Curriculum learning broke the scaling wall (n=50). The cache simulator showed L2 eliminates all misses.
Phase 2: Broad search (17 experiments)¶
Parity is linear over GF(2). GF(2) Gaussian elimination solves in 509 microseconds, 240x faster than SGD. Kushilevitz-Mansour influence estimation achieves ARD 1,585 (724x better than Fourier). All four local learning rules (Hebbian, Predictive Coding, Equilibrium Propagation, Target Propagation) fail at chance level because parity requires k-th order interaction detection.
Key insight¶
For small k, sparse parity is a search problem, not a learning problem. The neural network was solving an easy problem the hard way.
The Bigger Picture¶
Yaroslav's roadmap defines three axes of progress:
- Process (orange): improve how agents find better algorithms. Multiple members built independent harnesses (Claude Code, Replit Research OS, plain Claude). The process itself is the product.
- Metric (green): make the energy proxy more realistic. Started with ARD, added DMC (Data Movement Complexity, Ding et al.). Next step: actual GPU measurement on an H100.
- Problem (blue): make the task harder. Sparse parity is practice. The final exam is energy-efficient training of nanoGPT.
Take small steps along one axis at a time to keep complexity manageable. The group explicitly avoids premature partitioning (optimizing training but not inference, or math but not kernels).
Timeline¶
gantt
title Sutro Group Timeline
dateFormat YYYY-MM-DD
section Meetings
Meeting 1 - Energy Intro :m1, 2026-01-19, 1d
Meeting 2 - Forward-Forward :m2, 2026-01-26, 1d
Meeting 3 - Joules Measuring :m3, 2026-02-02, 1d
Meeting 4 - Beauty to Joules :m4, 2026-02-09, 1d
Meeting 5 - Intelligence/Joule :m5, 2026-02-16, 1d
Meeting 6 - Presentations :m6, 2026-02-23, 1d
Meeting 7 - Sparse Parity :m7, 2026-03-02, 1d
Meeting 8 - Demos + Roadmap :m8, 2026-03-09, 1d
section Research
Sprint 1 (ARD baseline) :s1, 2026-03-02, 1d
Sprint 2 (solve 20-bit) :s2, 2026-03-03, 2d
Phase 1 (16 experiments) :p1, 2026-03-04, 2d
Phase 2 (17 parallel agents) :p2, 2026-03-06, 1d
Survey written :sv, 2026-03-06, 1d
DMC metric + task triage :dm, 2026-03-11, 1d
People¶
| Name | Role / Focus |
|---|---|
| Yad | Created this repo (SutroYaro), built the Claude Code autonomous research lab: parallel agent teams, experiment templates, DISCOVERIES.md knowledge accumulation |
| Yaroslav | Sutro Group founder, technical sprints, algorithm work, cybertronai/sutro |
| Emmett | Aster agentic loop framework, 2x energy improvement on microgpt |
| G B | Architecture experiments (depth-1/hidden-64, ARD ~33-35) |
| Germaine | Presentations, implementations |
| Andy Zhang | ML consultant, GitHub contributor (zh4ngx), GF(2) noise experiment, TODO cleanup |
| Michael Keating | Former energy tech CEO (Scoot), Claude-based sparse parity approach |
| Seth | Healthcare AI, satisficing concepts |
| Barak | Modal workflow |
| Jamie Simon | Forward-Forward implementation |
| Jonathan Belay | Deterministic methods, spectral graph theory |
| Anish Tondwalkar | Former Google hardware engineer (inference chips), RL environments startup |
| Uliana Popov | Applied AI, temperature tuning suggestions |
| Josh (Joshua Marks) | Hardware engineer, SRAM/DRAM properties, circuit diagrams |
| Jack Schenkman | Research scientist, EE background, ASIC design |
| Preston Schmittou | 500-parameter transformers, message passing research |
| Caleb Sirak | DIY AI supercomputer ("Howard") |