SutroYaro: system overview¶
A structured workspace where coding agents run energy-efficiency research. You point a coding agent at the repo, it reads the context files, runs experiments, and accumulates findings. Multiple people do this independently. The results merge through PRs.
This page describes how the pieces connect. Each piece has its own docs page. This is the map.
What the system does¶
The Sutro Group wants to find training methods that use less energy. The toy problem is sparse parity (learn XOR from random bits). The real goal is nanoGPT. The workspace lets agents explore the method space, measure energy cost, and record what they find.
Three things happen in the workspace:
Agents run experiments. A coding agent reads CLAUDE.md and DISCOVERIES.md, picks a method, runs it against the locked harness, measures DMC (data movement cost), and writes up the finding. The two-phase protocol separates raw numbers (results.json) from interpretation (findings.md). LAB.md defines the rules. The run-experiment skill defines the steps.
Agents get evaluated. The eval environment (Gymnasium) wraps the same experiments into a benchmark. An agent picks from 16 methods, observes DMC, and gets graded on 12 categories (72 points). Did it find GF(2)? Did it notice local learning fails? Did it observe the ARD/DMC ranking disagreement? The answer key has 36 experiments as ground truth.
Humans review and merge. Contributors submit PRs. The reviewing agent checks locked files, findings format, log classification. If the result is significant, it gets a changelog entry. If not, it still merges. The weekly catch-up summarizes what happened across Telegram, Google Docs, and GitHub.
How the pieces connect¶
CLAUDE.md ──────────────────────────── Agent reads this first
│ (problem context, methods table,
│ current best results)
│
├── LAB.md ─────────────────────── Experiment protocol
│ └── Two-phase results Phase 1: results.json (numbers)
│ Reproducibility rules Phase 2: findings.md (analysis)
│ Metric isolation Don't edit tracker.py/harness.py
│
├── DISCOVERIES.md ─────────────── What's proven (read before every experiment)
│ └── 36 experiments DMC rankings, failure modes, scaling walls
│
├── AGENT.md ───────────────────── Autonomous loop protocol
│ └── Pick hypothesis From TODO.md or questions.yaml
│ Run against harness search_space.yaml bounds
│ Classify result WIN / PARTIAL / LOSS
│ Log to log.jsonl Append-only
│
├── .claude/ ───────────────────── Claude Code specific
│ ├── hooks/ session-start (status), security-guard
│ │ (locked files), session-end (summary)
│ ├── rules/ Reproducibility, agent coordination
│ ├── skills/ run-experiment, weekly-catchup,
│ │ prepare-meeting
│ └── settings.json Hook configuration
│
├── src/sparse_parity/eval/ ────── Eval environment
│ ├── env.py Gymnasium: SparseParity-v0
│ ├── grader.py 12 categories, 72 points
│ ├── registry.py Add methods without editing env.py
│ ├── backends.py Local / Modal / Remote
│ ├── answer_key.json 36 experiments as ground truth
│ └── adapters/ Anthropic, PrimeIntellect, HuggingFace
│
└── Automation ─────────────────── Sync and reporting
├── sync_telegram.ts Pull Telegram messages
├── sync_google_docs.py Pull Google Docs to markdown
├── docs/catchups/ Weekly summaries
└── docs/sessions/ Video transcripts and chapters
The two costs¶
Yaroslav's constraint: experiments should run within 1980s compute budgets. Under 1 second, ideally under 10ms.
The workspace measures two costs:
Experiment cost. How much data does the method move? DMC (data movement complexity) captures this. GF(2) at DMC 8,607 is 9 million times cheaper than Fourier at 78 billion. 4 of 16 methods run under 10ms. The eval environment tracks this per step.
Agent cost. How many experiments did the agent need? How many tokens of reasoning? The eval environment's efficiency category awards 5 points for finding the best method in 3 steps. But we don't yet track token cost or the agent's thinking time between steps. The goal is to measure both: can an agent find GF(2) using few tokens and small wall clock time, without trying all 16 methods?
Who does what¶
The coding agent reads CLAUDE.md, runs experiments, writes findings. It doesn't need to know about the eval environment or the hooks. Those are layers on top.
The hooks (Claude Code only) fire automatically. Session-start shows status so the agent doesn't ask "catch me up." The security guard blocks edits to measurement code. Session-end shows what changed. Other coding agents (Gemini, Codex) skip hooks and read CLAUDE.md directly.
The eval environment is a different mode. Instead of the agent writing experiment code, it picks from 16 existing methods and observes results. The grading checks whether it made specific discoveries. This can run locally, through Anthropic tool calls, on PrimeIntellect, or as a HuggingFace leaderboard.
The weekly catch-up syncs Telegram, Google Docs, and GitHub into a summary page. The prepare-meeting skill compiles recent experiments into a report.
Contributors fork the repo, run experiments, submit PRs. The reviewing agent checks the PR against the experiment protocol. See the changelog for merged contributions.
What each file is for¶
| File | Who reads it | What it does |
|---|---|---|
| CLAUDE.md | All agents | Problem context, methods table, best results |
| LAB.md | Agents running experiments | Protocol, rules, templates |
| AGENT.md | Autonomous agent loop | Pick hypothesis, run, classify, log |
| AGENT_EVAL.md | Agents using the eval env | How to add methods, run evals, read grading |
| DISCOVERIES.md | Everyone, before every experiment | Proven facts, open questions |
| TODO.md | Agents looking for work | Hypothesis queue |
| .claude/settings.json | Claude Code | Hook configuration |
| .claude/rules/*.md | Claude Code | Reproducibility, coordination constraints |
| .claude/skills/*/SKILL.md | Claude Code | Workflow definitions |
What still needs to happen¶
This workspace handles sparse parity. The path to nanoGPT requires:
- Actual GPU energy measurement (Issue #6: compare DMC vs ARD vs real joules on an H100)
- A harder problem (nanoGPT character-level training, Issue #9)
- Agent cost tracking (tokens per discovery, thinking time between steps)
- PR review automation (Issue #54: webhook + Claude Code + Telegram approval)
The workspace itself is the product. Improving how agents find better algorithms is progress along Yaroslav's "process" axis, independent of which problem they're solving.
Docs map¶
| Page | What it covers |
|---|---|
| Eval environment | Gymnasium env, grading, adapters, baselines |
| Agent infrastructure | Hooks, rules, skills, V2 diagram |
| Adding a challenge | Step-by-step for new problems |
| Adding an eval challenge | Registry, answer key, baselines |
| Survey | All 33 experiments ranked |
| Context | Group history, timeline, people |
| Peer research protocol | Multi-researcher workflow |