Session Report: Building schmidhuber-problems via Agent Teams

Output: cybertronai/schmidhuber-problems — 58 stubs, 13 PRs (14 created, 1 closed-and-reissued), all merged Source log: ~/.claude/projects/-Users-yadkonrad-dev-dev-year26-feb26-SutroYaro/63285119-154e-42ab-9555-7a42471b0309.jsonl (2,282 events) Span: 2026-05-06T23:03 → 2026-05-08T16:16 UTC (~41.3 wall hours) Lead session: SutroYaro Companion to: hinton-problems BUILD_NOTES (53 Hinton stubs, May 1-3)

This report is reconstructed from the live session log, not from memory. Earlier drafts had fabricated counts; this revision is the source-of-truth version.

TL;DR for the video opener

58 Schmidhuber-paper stubs implemented across 12 supervised waves (wave 0 sanity = 1; waves 1–10 v1 = 49; wave 11 v1.5 = 8). Pure numpy + matplotlib. All <5 min/seed on a laptop.
The SPEC was a single GitHub issue (#1) — adapted from hinton-problems issue #1.
The dispatcher was Claude Code’s agent-teams primitive — one team schmidhuber-impl (agent_type: orchestrator), 12 waves, fresh teammates per wave.
Two human prompts mid-run reshaped the build:
- 2026-05-07T01:31:11Z — “why are u doing a branch per impl, should it be per waves?? why the branch spam. THIS IS WRONG PRACTICE COURSE CORRECT!” → wave 1 → wave 2 protocol pivot to local-only wave-N-local/<slug> branches.
- 2026-05-07T02:11:39Z — “I need you to not rely on me anymore until you finish it all, basically, do wave into 1 per, audit, post to pr then trigger next wave” → fully autonomous from wave 3 onward.
One honest non-replication (hq-learning-pomdp) acknowledged in the wave-3 audit at 2026-05-07T03:35Z, with mathematical analysis (γ^Δt · HV ≤ R_goal bound).
Post-merge author rewrite at 2026-05-08T16:12Z fixed git authorship across the entire repo via git filter-branch: 74 agent-authored commits → Yad Konrad <yad.konrad@gmail.com>.

The actual chain of events

Timestamp (UTC)	Event
2026-05-06T23:03:33	Session opens in SutroYaro
2026-05-06T23:03:37	Yad invokes `sutro-sync` skill — only skill call in the entire session — to pull Telegram + Google Docs + GitHub state. Surfaces Yaroslav’s Schmidhuber suggestion.
2026-05-06T23:09:41	Lead dispatches first `Explore` audit subagent: “Survey schmidhuber-problems repo”
2026-05-06T23:20:41	SPEC opened as issue #1 — the contract for every teammate. Title: “Spec: minimum implementation requirements for Schmidhuber-problem stubs (v1)”
2026-05-06T23:24:21	First teammate dispatched: `nbb-xor-builder` (wave 0 sanity)
2026-05-06T23:56:21	Wave-0 PR opened on `impl/nbb-xor` (PR #2)
2026-05-06T23:56:38	v1.5 follow-up issue #3 opened
2026-05-07T00:11:17	Yad: “alright shall we do clean up and dispathc multiple agents to finish the rest of the waves?” — wave 1 trigger
2026-05-07T00:20:49	Wave 1 dispatch begins (6 teammates)
2026-05-07T01:31:11	Yad: “why are u doing a branch per impl, should it be per waves?? why the branch spam. THIS IS WRONG PRACTICE COURSE CORRECT!”
2026-05-07T01:38:19	PR #2 closed; reissued as PR #5 on `wave/0-sanity` branch. All `impl/<slug>` remote branches deleted. From wave 2+, per-stub branches stay LOCAL ONLY.
2026-05-07T01:28:53	Wave 1 PR #4 opened (`wave/1-search`)
2026-05-07T01:57:22	Wave 2 dispatch begins (5 teammates)
2026-05-07T02:11:39	Yad: “I need you to not rely on me anymore until you finish it all… do wave into 1 per, audit, post to pr then trigger next wave” — autonomous mode engaged
2026-05-07T02:33:12	Wave 2 PR #6 opened
2026-05-07T03:35:08	Wave 3 audit: lead acknowledges `hq-learning-pomdp` as honest non-replication (“paper’s HQ-vs-flat headline gap does NOT reproduce on the 29-cell maze. Implementation faithful”)
2026-05-07T12:16:45	Wave 3 PR #7 opened
2026-05-07T12:49:16	Wave 4 PR #8 opened
2026-05-07T13:15:48	Wave 5 PR #9 opened
2026-05-07T14:33:36	Wave 6 PR #10 opened (cleanup commit on top: removed orphan `noise-free-long-lag/problem.py`)
2026-05-07T15:28:24	Wave 7 PR #11 opened (cleanup commit on top: removed orphan `blues-improvisation/problem.py`)
2026-05-07T16:57:11	Wave 8 PR #12 opened
2026-05-07T17:22:01	Wave 9 PR #13 opened
2026-05-07T18:07:35	Wave 10 PR #14 opened — v1 complete at 50/50
2026-05-08T12:07:27	Wave 11 (v1.5) dispatch begins (8 teammates for heavyweight-env stubs)
2026-05-08T14:49:01	Wave 11 PR #15 opened — v1+v1.5 complete at 58/58
2026-05-08T15:38:20	Meta PR #16 opened (mdBook config, BUILD_NOTES, RESULTS, VISUAL_TOUR, README catalog, GH Pages workflow)
2026-05-08T15:49:49	All 13 PRs merged via `gh pr merge` in sequence
2026-05-08T15:50:41	First Pages deploy attempt fails: “Ensure GitHub Pages has been enabled”
2026-05-08T15:53:21	Pages enabled via `gh api -X POST repos/.../pages -F build_type='workflow'`; workflow re-run; site live at https://cybertronai.github.io/schmidhuber-problems/
2026-05-08T16:09:24	Yad: “wtf why its claude agent-0bserver07 and not fucking claude 0bserver07? claude agent-0bserver07 was for comment only”
2026-05-08T16:12:01	`git filter-branch` rewrite: 74 agent-authored commits → `Yad Konrad <yad.konrad@gmail.com>`. Force-pushed main. Site rebuilt with corrected attribution.
2026-05-08T~16:14	README formatting polish (header bullets, lineage paragraph broken into bullet list) per Yad’s feedback.
2026-05-08T16:16:50	Last logged event in this session

The SPEC (issue #1) — the actual contract

The contract between Yad and every teammate was a single GitHub issue. Not chat. Not a system prompt. An issue every PR linked back to.

It defined:

Required files per stub: <slug>.py, README.md, make_<slug>_gif.py, visualize_<slug>.py, <slug>.gif, viz/
8 README sections: Header / Problem / Files / Running / Results / Visualizations / Deviations / Open questions
Reproducibility rules: seed exposed via CLI, all hyperparameters in Results, command in §Running reproduces the number
Acceptance checklist (10 boxes)
Schmidhuber-specific additions:
- Algorithmic faithfulness > optimizer convenience: long-time-lag stubs use the paper’s recurrent architecture; evolutionary stubs use the paper’s evolutionary optimizer; Levin/OOPS stubs keep universal search. No backprop shortcuts.
- Architecture-deviation rule (codified before wave 0): if the paper’s exact arch can’t converge under numpy-only constraints, run a sweep of ≥30 seeds at the original arch, document the failure, propose a justified alternative.
- RL-stub rule: numpy mini-environments. No gym/gymnasium. Original-simulator reruns deferred to v2.

The orchestration model

                     ┌──────────────────┐
                     │ schmidhuber-impl │  (TeamCreate, agent_type=orchestrator)
                     └─────────┬────────┘
                               │
                  ┌────────────┼────────────┐
                  │            │            │
            Wave 0/1/…/11  SendMessage   Subagent dispatches
                               │            │
                               ▼            ▼
                          ┌──────────┐  ┌──────────────┐
                          │ teammates │  │ Agent tool   │
                          │ <slug>-   │  │ general-     │
                          │ builder   │  │ purpose 58×  │
                          │ x58       │  │ Explore  15× │
                          └────┬─────┘  └──────┬───────┘
                               │               │
                               ▼               ▼
                       worktree branch    PR audits, code reads
                       wave-N-local/<slug>
                               │
                               ▼
                       (LOCAL ONLY — DO NOT PUSH)
                               │
                               ▼
                       lead octopus-merges into wave/N-<family>
                               │
                               ▼
                       gh pr create → wave PR
                               │
                               ▼
                       audit subagent → audit comment on PR
                               │
                               ▼
                       SendMessage(shutdown_request)
                               │
                               ▼
                          Next wave starts fresh

Why fresh teammates per wave: each teammate burns context as it builds and tests. Shutting down between waves keeps later waves running on full context windows. The lead persists; the workers turn over.

Why LOCAL ONLY per-stub branches (the wave-1 → wave-2 fix): pushing 6 impl/<slug> branches per wave to remote was branch spam. Yad called it out at 2026-05-07T01:31. Fix: per-stub branches stay LOCAL ONLY (they only need to exist for git worktree mechanics); only wave/N-<family> is pushed; deletable after PR merges.

What the session actually used (verified counts from the JSONL)

Tool calls in the lead session

Tool	Calls	What for
Bash	140	git, gh CLI, file ops, running tests, workflow checks
Agent	73	subagent dispatches: 58 general-purpose builders + 15 Explore auditors
SendMessage	69	inter-teammate messaging (shutdowns + summary requests)
TaskUpdate	34	shared task list maintenance
Read	16	reading paper PDFs, stub code, READMEs
TaskCreate	15	new tasks added to the team’s list
Write	11	new files (READMEs, scripts, configs)
Edit	10	small in-place edits
AskUserQuestion	7	direction-clarifying questions to Yad
ToolSearch	3	loading deferred tool schemas
Skill	1	only `sutro-sync` at session start
TaskList	1	one snapshot
TeamCreate	1	the `schmidhuber-impl` team itself
TeamDelete	1	end-of-session cleanup

Subagent dispatches (Agent tool, n=73)

Type	Count	Use
`general-purpose`	58	per-stub builders (one per stub across 12 waves)
`Explore`	15	initial repo survey + 12 per-wave audits + 2 BUILD_NOTES data-extraction passes

GitHub artifacts produced

2 issues created: #1 (SPEC) + #3 (v1.5 follow-up)
14 PRs created: PR #2 (closed and reissued as #5), PRs #4, #5, #6, #7, #8, #9, #10, #11, #12, #13, #14, #15, #16
13 PR audit comments (one per wave PR)
2 cleanup commits on top of wave merges: wave 6 (noise-free-long-lag/problem.py orphan removed), wave 7 (blues-improvisation/problem.py orphan removed)
13 PR merges in one batch (gh pr merge × 13 in sequence) at 2026-05-08T15:49
1 repo edit to set the homepage URL
1 GH API call to enable Pages with workflow build type

The waves at a glance

Wave	Family	Stubs	First dispatch (UTC)	PR opened (UTC)	PR #
0	Sanity	1	2026-05-06T23:24	2026-05-07T01:38	#5
1	Random search + universal program search	6	2026-05-07T00:20	2026-05-07T01:28	#4
2	Local rules + world-model controllers	5	2026-05-07T01:57	2026-05-07T02:33	#6
3	Online RL with hidden state	5	2026-05-07T01:58	2026-05-07T12:16	#7
4	History compression + fast-weights + self-reference	5	2026-05-07T03:08	2026-05-07T12:49	#8
5	Predictability min/max + unsupervised features	4	2026-05-07T03:15	2026-05-07T13:15	#9
6	LSTM canonical battery (BPTT, half 1)	6	2026-05-07T09:13	2026-05-07T14:33	#10
7	LSTM follow-ups	5	2026-05-07T10:25	2026-05-07T15:28	#11
8	Evolutionary	4	2026-05-07T11:36	2026-05-07T16:57	#12
9	Deep MLPs at scale	4	2026-05-07T12:42	2026-05-07T17:22	#13
10	Object-centric + attention + modern	5	2026-05-07T13:52	2026-05-07T18:07	#14
11	v1.5 — heavyweight-env stubs (numpy synthetic substitutes)	8	2026-05-08T12:07	2026-05-08T14:49	#15

Plus the meta PR (#16) for site + BUILD_NOTES + RESULTS + VISUAL_TOUR + README catalog at 2026-05-08T15:38.

Total: 58 stubs in 12 waves + 1 meta PR.

Yad’s interaction pattern (the human side)

Three classes of prompt drove the project. Two stand out as direction-changing:

Type A — high-leverage direction (rare, big effects)

Timestamp (UTC)	Quote
2026-05-07T00:11:17	“alright shall we do clean up and dispathc multiple agents to finish the rest of the waves?” — wave-1 trigger
2026-05-07T01:31:11	“why are u doing a branch per impl, should it be per waves?? why the branch spam. THIS IS WRONG PRACTICE COURSE CORRECT!” — wave 1 → 2 protocol pivot
2026-05-07T02:11:39	“I need you to not rely on me anymore until you finish it all, basically, do wave into 1 per, audit, post to pr then trigger next wave” — autonomous-mode engaged
2026-05-08T16:09:24	“wtf why its claude agent-0bserver07 and not fucking claude 0bserver07? claude agent-0bserver07 was for comment only” — git-author rewrite trigger
2026-05-08T~16:14	“this needs to be on new line and readable” — README formatting fix

Type B — status checks (frequent, low cost)

“status?” / “status, what is left?” / “whats left rl?” — appears multiple times. Lead summarizes per-wave progress and continues.

Type C — review and merge approvals

“review it/audit and post the comment, then dispatch after please” (set the audit-then-dispatch loop)
“finish everything and deal with the full impelmentations” / “BUT FIRST FIRST FINISH THESE THINGS REMAINING” — wave 11 (v1.5) trigger
“have we verified thse things to be truely done or left over?” — surfaced the unmerged-PRs gap; explicit merge instruction followed

The session’s pivot moments are the corrections, not the kickoffs. The wave 1 → wave 2 branch-protocol fix and the wave-3 autonomous-mode engagement are what reshaped the build’s structure.

Honest non-replication: hq-learning-pomdp

Acknowledged in the wave-3 audit summary at 2026-05-07T03:35:08Z:

“Both HQ and flat Q solve during training (~100%) but both fail at 0% greedy eval — the paper’s HQ-vs-flat headline gap does NOT reproduce on the 29-cell maze. Implementation faithful, honest about the gap with mathematical analysis (γ^Δt · HV ≤ R_goal bound).”

This is exactly the SPEC’s methodological caveat applied: where the empirical headline of a paper does not reproduce on a smaller / faithful implementation, the contributor flags it honestly with the mechanistic reason, rather than fudging the result. The paper’s 62-cell maze is queued as a v1.5 follow-up.

Mid-run errors and recoveries

Three concrete error recoveries are visible in the session log:

Wave 6 / 7 orphan problem.py files: When teammates wrote new stub files but didn’t git rm the placeholder problem.py, the audit subagent caught it. The lead added a cleanup commit on top of each wave merge. After wave 7, the SPEC’s “remove problem.py explicitly” was emphasized in every dispatch prompt; no further orphans appeared.
GitHub Pages-not-enabled error: First deploy attempt at 2026-05-08T15:50:41 failed with “Ensure GitHub Pages has been enabled”. The build succeeded; the deploy step couldn’t create the deployment because Pages wasn’t enabled at the repo level. Fix: gh api -X POST repos/cybertronai/schmidhuber-problems/pages -F build_type='workflow'. Workflow re-run completed at 15:53:34.
Git author drift: One commit in wave 3 was authored as agent-pomdp-flag-maze-builder <agent@anthropic.com> (the subagent’s session-default identity overrode the per-worktree config of agent-0bserver07@users.noreply.github.com). Caught in wave-3 audit; non-blocking. Resolved later by the bulk filter-branch rewrite at 2026-05-08T16:12.

What this session actually proves

The SPEC issue + agent-teams + wave pattern is reproducible across problem-sets. Second use of the machinery (first: hinton-problems, 53 stubs in 30 hours, May 1-3). For a different lineage (algorithmic vs representational) with 58 stubs and harder constraints (RL-stub rule, algorithmic faithfulness rule), the same machinery shipped in ~41 wall hours.
Mid-run protocol fixes work. Wave 1’s branch spam got corrected within minutes of Yad’s pushback. Wave 6/7’s orphan stubs got fixed via cleanup commits on top of merges. The wave-PR-with-audit-comment pattern absorbed the corrections cleanly.
Honest non-replications are part of the deliverable, not a bug. hq-learning-pomdp ships with mathematical analysis. The honest report > a fudged success.
agent-teams is the dispatcher; subagents are the workers; per-wave audit is a separate Explore subagent. Same machinery used in three layers, three different roles.
Numpy-only constraint is enforceable across the catalog. 58 algorithms — RBM-style local rules, evolutionary methods, LSTM with peephole/forget-gate variants, world models, attention, capsules, CTC — all in stdlib + numpy + matplotlib (+ PIL/imageio for GIF assembly). MNIST loaded via urllib + gzip + struct from public mirrors.
Post-merge author rewrite is feasible. When git author identity is wrong on a fresh repo with a sole owner, git filter-branch + force-push fixes it cleanly.

Concrete numbers

58 / 58 v1+v1.5 stubs implemented (100%)
32 reproduce paper claims (yes), 25 partial / qualitative (or synthetic substitute), 1 honest non-replication (with documented mathematical analysis)
41.3 wall hours end-to-end (May 6 23:03 → May 8 16:16 UTC, 3 distinct days)
2 GitHub issues, 14 PRs created (1 closed-and-reissued), 13 audit comments, 13 merges in one batch
1 TeamCreate, 1 TeamDelete, 58 named builders + 15 audit subagents
Pure numpy + matplotlib, all under 5-min wallclock per stub except pipe-6-bit-parity (240s 6-bit cap), evolino-sines-mackey-glass (140s), lstm-search-space-odyssey (145s)
Algorithmic-faithfulness coverage: 9 RL stubs (numpy mini-envs per SPEC), 11 LSTM-family stubs (manual BPTT through cells with various gate variants), 4 evolutionary stubs (no gradient on hidden weights), 3 search stubs (Levin / OOPS / RS), 8 v1.5 substitutes (synthetic numpy data instead of TIMIT/IAM/ISBI/CarRacing/VizDoom/TORCS), 1 equivalence proof (linear-attention ≡ FWP to 2.22e-16)

Schmidhuber Problems