How we built 58 numpy stubs in 40 hours with one orchestrator and 58 throwaway agents

By Yad Konrad — @0bserver07

On 2026-05-21, Mark Saroufim mentioned this build in his MLSys keynote. This page is the structured map of what actually happened — the orchestrator session, the 58 worker sessions, the 12 waves, the dollar numbers, the two prompts that reshaped the protocol mid-build, the things that went wrong.

“learnings, what worked well, what didn’t, how to repro, push to hacker news and linkedin” — Cosmin Negruseri, who suggested writing this up.

The headline

Between 2026-05-06 23:03 and 2026-05-08 16:16 UTC — about 41 hours from session start to the wave-11 PR merge, with two overnight idle gaps of ~10 hours each (~21 active hours of attention) — one Claude Code session orchestrated 58 worker sessions to implement 58 papers from Jürgen Schmidhuber’s experimental archive (1989–2025). Pure numpy + matplotlib, deterministic, every stub runs in <5 min/seed on a laptop. (The orchestrator session stayed open until 2026-05-09 18:58 UTC for follow-up writeup work; that’s the 67.9-hour session span shown on the Sessions page.)

Metric	Value
Worker sessions	58
Numbered waves	12 (wave 0 sanity + waves 1–10 v1 + wave 11 v1.5)
PRs merged	13 (one per wave) + 1 meta + 1 token-math fix = 15
Total estimated cost	$3,879 at Opus 4.x public pricing
Total tokens	1.13 billion (94.5% cache_read)
Yad-typed prompts to the orchestrator	40 (across ~21 active hours)
Of those, direction-changing	8 (~20%)
Orchestrator assistant turns	1,026
Turns per Yad-typed prompt (orchestrator)	~25.7 : 1
Combined assistant turns (orch + 58 workers)	7,265
Subagent dispatches (Agent tool calls from orchestrator)	73 (58 `general-purpose` builders + 15 `Explore` audits)
Inter-team messages (SendMessage from orchestrator)	69
Bash invocations	190

Note on the prompt count. Earlier drafts of this page said “192 user prompts.” That number was the count of every type=user record in the orchestrator’s JSONL — but 142 of those were worker sessions reporting back to the orchestrator (their summary + idle messages arrive as type=user records in the lead’s transcript). The actual Yad-typed prompts to the lead were 40. The math: 40 Yad + 142 worker-replies + 6 slash commands + 2 skill-loader outputs + 2 redacted = 192 records. Sessions page has the breakdown; Human in the loop has the classification of the 40.

The catalog is at cybertronai.github.io/schmidhuber-problems. The repo is github.com/cybertronai/schmidhuber-problems.

How to read this site

This Build internals section is the receipt — the actual data, not a writeup of it. Five entry points depending on what you want:

If you want…	Start here
The narrative arc of the build	This page, then Pivot moments
Specifically what worked and what didn’t	What worked, what didn’t
To run a similar build yourself	How to reproduce
The research finding about manual nudges	Human-in-the-loop as local-minima escape
The 1-orchestrator-to-58-workers mapping	Orchestration map
Per-session numbers (cost, tokens, hops, turns)	Sessions
Where the money went	Cost rollup
The worker prompt template, annotated	Worker prompt anatomy
Drill into a specific wave	Per-wave details
What’s still open	Next phase

What’s load-bearing

Six properties carried the build. Each is documented with its discovery moment and the data in What worked, what didn’t.

One SPEC issue as the contract — issue #1. Every worker prompt links to it.
Pure numpy + matplotlib as the forcing constraint. No torch shortcuts. The constraint did most of the algorithmic-faithfulness work.
One persistent team (TeamCreate × 1), 58 throwaway teammates. Worker prompts inherit the team description; per-worker prompts stay short.
LOCAL-ONLY per-stub branches, one PR per wave (not per stub). After wave 1 we deleted 72 spam branches; from wave 2 on, no branches were pushed.
Explore audit subagent per wave, read-only, before opening the PR. Cheap; catches inconsistencies (orphan files, format drift) that humans would miss in review.
Batch-merge at the end as the human approval gate. All 13 PRs merged in a 90-second burst on 2026-05-08 15:49–15:50 UTC.

What broke

Six things went wrong during the build. Each is fixable; each is documented in What worked, what didn’t with the timestamp it was caught and the change that landed.

Branch-per-stub got pushed to origin on wave 1 — branch spam. PR #2 closed and reissued as PR #5; protocol switched to LOCAL ONLY.
Workers committed locally and went silent. The orchestrator had to nudge with explicit Request summary message SendMessages. Three workers in waves 3, 10, 11 triggered this.
Wave 6 and 7 left orphan problem.py stub files. Caught by the audit agent; lead added a cleanup commit on top of each wave merge.
One commit was authored as agent-pomdp-flag-maze-builder <agent@anthropic.com> — the per-worktree git config was overridden by Claude Code’s session default. Resolved post-merge with a git filter-branch rewrite (74 commits → Yad Konrad).
GitHub Pages deploy failed first try. One API call (gh api -X POST repos/.../pages -F build_type='workflow') and a workflow rerun fixed it.
The first BUILD_NOTES was written from memory and had fabricated counts. PR #20 reissued it from the actual JSONL session log.

The Yad-on-loop pattern

The build worked on a rhythm that’s worth naming. Yad’s own self-summary from 2026-05-08T16:44:

“we did it again, 780k token, took a little longer, since only paid attention every 18 hour window while i have other things going on”

40 Yad-typed prompts spread across two days of active attention (the build proper was ~21 of those 41 wall hours; the rest was overnight idle). The autonomous loop had to survive two ~10-hour gaps when Yad was asleep. It mostly did.

Of those 40 prompts, 8 were direction-changing (the “Type A” hops in Human in the loop). The other 32 were status checks, small clarifications, and acks. The build was carried by a handful of 1-sentence interventions:

“why are u doing a branch per impl, should it be per waves?? why the branch spam. THIS IS WRONG PRACTICE COURSE CORRECT!” — 2026-05-07 01:31 UTC

“I need you to not rely on me anymore until you finish it all, basically, do wave into 1 per, audit, post to pr then trigger next wave” — 2026-05-07 02:11 UTC

“have we verified thse things to be truely done or left over?” — 2026-05-08 15:42 UTC

Cosmin Negruseri put a name on this pattern weeks before the build:

“I have the feeling I was useful by pinging some wave of agents to do diagnostics and that got the solution out of a local minima.” — 2026-05-14

Sung Jae Bae sharpened it the morning after the MLSys keynote:

“Seems to enter the local minima fairly quickly so adding some skills to creatively explore different directions.” — 2026-05-21

A long-running autonomous agent loop builds internal consistency fast and protects it. Outside perspective is the rare commodity. See Human-in-the-loop for the data-backed version of this claim.

The cost story

Pool	Tokens	$ share
`cache_read`	1,064 M	41%
`cache_write_1h`	47 M	36%
`output`	11 M	21%
`input`	0.2 M	0.1%
`cache_write_5m`	4 M	2%

cache_write_1h was 36% of the bill despite being 4% of the tokens. Every cache invalidation cost $30/M to re-cache. Output, the conventional cost driver, was third. This is a real surprise — if you’re optimizing a long-running orchestration session, watch the 1h cache writes before you worry about output volume. See Cost rollup for the per-pool and per-wave breakdown.

Per stub: median $41, range $21–$122. The outlier was pipe-6-bit-parity — it hit a tricky LSTM training issue and needed extra turns.

How this catalog will be used

The deliverable isn’t the 58 stubs. The stubs are easy. The deliverable is a reusable agent-orchestration recipe:

A SPEC issue is enough contract for 58 parallel implementers.
LOCAL-ONLY worktrees + one-PR-per-wave eliminates branch spam.
A per-wave audit subagent costs ~3-8% overhead and catches what humans miss in review.
Pure-numpy as a forcing constraint surfaces honest non-replications instead of hiding them.
One human attention window every ~18 hours is sufficient if the autonomous loop is well-specified.

How to reproduce is the recipe in 8 steps. The first run was hinton-problems (53 stubs, week before, same machinery). This was the second run. The recipe survives both.

What’s still open

v2: ByteDMD instrumentation. Re-measure every stub under data-movement complexity. Tracking issue: #17.
v1.5 follow-ups. Paper-scale reruns + original-simulator follow-ups for the heavyweight-env stubs. Tracking issue: #18.
One honest non-replication — hq-learning-pomdp. The paper’s HQ-vs-flat headline does not reproduce on the 29-cell maze. Implementation faithful; queued for v1.5 with the 62-cell maze.
Trace export + language scrub + autonomy classification. Detailed roadmap at Next phase.

Credits and lineage

Yad Konrad (0bserver07) — the orchestrator-driver of this build. The 40 hand-typed prompts in analysis/data/sessions.jsonl are his (the other 152 type=user records in the orchestrator’s transcript are workers reporting back via SendMessage).
Yaroslav Bulatov — proposed implementing Schmidhuber’s experimental archive at the start of this build; the SPEC’s algorithmic-faithfulness rule is his framing.
Cosmin Negruseri — the local-minima-escape observation; the writeup ask.
Mark Saroufim — referenced this work in his 2026-05-21 MLSys keynote, which prompted Cosmin’s “you should write this up” message.
Hinton-problems builders (2026-05-01 → 2026-05-03) — first run of the same recipe. Stubs at cybertronai/hinton-problems.

Schmidhuber Problems