Skip to content

Status Update: Apr 30 to May 20, 2026

First catch-up in three weeks. The headline: the lab's research activity has moved almost entirely out of SutroYaro into a constellation of sibling repos. Four challenges now run in parallel, two of them launched inside this window, and a new energy-efficient language-modeling task got its own repo. SutroYaro itself stayed quiet (one merged PR), which matches its scoped role as lab memory rather than research front.

Sync Status

Source Result
Telegram All 11 forum topics now synced (was 6). 385 new messages. Five topics backfilled for the first time: challenge #2, challenge #3, wikitext, Pitch / Talking Points, makemore task results.
Google Docs 17 tracked docs refreshed, all current. No new meeting docs since Meeting #9 (16 Mar).
GitHub SutroYaro: 6 open issues, 0 open PRs. The work is in the sibling repos (table below).

The Telegram sync had been missing five topics because both sync scripts carried a hardcoded six-topic list. I updated src/telegram/sync.ts and sync_telegram.ts to cover all 11 forum topics. This was a flagged action item from the last catch-up.

The Shape of the Lab Now

Six weeks ago the work was one challenge (sparse parity) in two repos. It is now four challenges across a repo constellation:

  • Challenge #1, sparse parity (original, ByteDMD metric): dormant.
  • Challenge #2, energy-efficient matmul (launched Apr 30): sutro-problems/matmul.
  • Challenge #3, sparse parity on the grid (launched May 8): sutro-problems/sparse-parity.
  • wikitext, energy-efficient language modeling (effectively Challenge #4): its own repo cybertronai/wikitext.

Alongside the challenges, two stub catalogs shipped: hinton-problems (53 stubs) and schmidhuber-problems (58 stubs). Both are inputs to a planned filtering process that picks the next competition task.

Per-Channel Breakdown

chat-yad (117 messages, the most active channel)

Three threads ran here.

  1. hinton-problems build. Yaroslav asked for reference implementations of all of Hinton's learning problems, on the hypothesis that minimizing data movement (reuse distance) would push agents toward methods other than backprop. Yad ran a multi-agent wave build and shipped all 53 stubs to cybertronai/hinton-problems on May 3 across 10 wave-level PRs. 27 reproduce paper claims, 25 partial, 1 non-replication. Catalog (RESULTS.md), build notes, a video overview, and a writeup followed.
  2. schmidhuber-problems build. Yaroslav created a matching stub set; Yad ran the same swarm and shipped 58 stubs on May 8 across 12 wave PRs. 32 reproduce, 25 partial, 1 non-replication, roughly 41 hours wall time. The token count was corrected in public from an initial 780k (a UI display, not a meter) to roughly 1.15 billion actual tokens.
  3. The reshuffle. SutroYaro issue #96 (strip the repo to its lab-memory role). Seth's agent concluded only sparse-parity-challenge needed changes; PR #40 there migrated the research code in, and PR #97 here removed the duplicates. Merged May 14.

Recurring meta-point from Yaroslav: research can be run in batches of ideas rather than one at a time, and because agents produce unlimited text while human input stays sparse, the human contribution should be captured and prioritized in every PR.

wikitext (new channel, 127 messages since May 6)

The newest challenge: energy-efficient language modeling on WikiText-103, led by new member Armins.

  • Infrastructure moved from Lambda Labs to Modal, with GPU energy measured via NVML. The baseline submission modded_nanogpt runs in 322.7s for 54,784 J at 0.7285 character-accuracy.
  • A task-scale ladder emerged: matmul as the fast-iteration end, WikiText-103 as the "final" task, with Shakespeare and TinyStories as intermediate steps toward FineWeb.
  • Two tracks: applied (real GPUs, NVML energy, leaderboard) and theory (local learning on a 2D grid). Apple Silicon / MPS came up as a possible applied target.
  • Objective framing: hold accuracy and wall-clock time fixed, shrink Joules.
  • The task got its own repo cybertronai/wikitext on May 12, with the "wip" prefix dropped May 11 ahead of Yaroslav's AI Council talk on May 12.
  • Open as of today: new member Gabriel found the energy meter only counts GPU energy through NVML, so CPU-heavy submissions can hide work. The proposal is to add CPU package energy (RAPL) and measure total system energy. Yaroslav's position is no device constraints, just total system energy under time and accuracy bounds.

General (67 messages)

  • Anastasiia Zhiboedova shared SutroAna, an agent harness for a focused "improve this one problem" loop (github.com/adotzh/SutroAna). Yaroslav confirmed it works out of the box.
  • sutro-problems is positioned as the "mess around" repo, where people experiment freely and the useful parts get distilled later.
  • Armins ran a sweep of alternative LM architectures for wikitext (forward-forward, Mamba, Hyena/SSM). Forward-forward reaches 0.39 accuracy at roughly 10x fewer Joules than modded_nanogpt; the state-space models have not converged yet.
  • Getting-started directions posted for wikitext: scale forward-forward toward 70%, get an SSM to converge, find the ceiling of prediction-by-partial-matching, parallelize Decoupled Greedy Learning.

challenge #2: energy-efficient matmul (35 messages, launched Apr 30)

  • Task: minimize data-movement energy for matmul, expressed as an intermediate representation (IR) on Bill Dally's 2D grid. Repo: sutro-problems/matmul.
  • The 16x16 matmul record ground down over the window from 68,452 (Sung Jae Bae, May 5) to 68,392 (Cosmin Negruseri, May 13) to 67,821 (wave 15, May 14). Cosmin ran a Codex-driven auto-research loop producing a steady stream of small PRs.
  • Lower bounds are hard. An agent-generated lower bound turned out to be wrong on inspection; Yaroslav notes even AlphaTensor could not get a lower bound for 4x4.
  • Submission protocol: the scorer (matmul.py) must not be modified.

challenge #3: energy-efficient sparse parity (new channel, 36 messages, May 8 to 12)

  • A grid-model version of sparse parity: solve it using only about 9 instructions on the Dally grid. Distinct from Challenge #1, which used the ByteDMD element-level metric. Repo: sutro-problems/sparse-parity.
  • Sung Jae Bae found that tiling does not help at this problem size, but precomputing intermediate XORs plus bit-packing works well.
  • Yaroslav considers this format hard to cheat on because the op set is tiny, and notes no agent-cheating results have appeared with the IR approach.
  • Open question raised here: whether continuous floating-point numbers are needed at all, or whether integer arithmetic and small (8-bit) instruction sets are preferable, since fewer ops means fewer hidden "free" optimizations for agents to exploit.

chat-yaroslav (24 messages)

  • The two-track framing again: theory (solving learning problems abstraction-free on the Dally grid) and applied (the same, using PTX or real hardware).
  • Extending sparse parity to the 2D grid model, possibly by adding a few instructions to simplified-dally-model.
  • wikitext cold-start work: Modal memory snapshots, and the observation that importing torch costs roughly 20,000 file reads.
  • Yaroslav connected with Niv AI, who offer finer-grained GPU power monitoring than Nvidia SMI, with a follow-up planned.

In-person meetings (16 messages)

  • May 4 (meeting #16): guest talk by Russ Pantone, formerly of Rain.AI, on the company's history building AI chips and the lessons learned.
  • May 11 and May 18 meetings continued the weekly Monday cadence. The May 18 meeting was at 380 Brannan St, where Yaroslav walked through benchmark and records progress.

Dormant channels (now synced, no activity in the window)

  • challenge #1: sparse parity (last message Mar 26)
  • Pitch / Talking Points (last Mar 28)
  • Introductions (last Apr 11)
  • makemore task results (last Feb 17)

The v2 / v3 Metric Roadmap

A thread running across chat-yad and chat-yaroslav. The plan is two filter passes over the hinton-problems and schmidhuber-problems stub catalogs:

  • v2: which stubs can the ByteDMD metric instrument at all.
  • v3: which of those can the Bill Dally 2D-grid model instrument.

The survivors become the source of the next hill-climbing competition. Yaroslav has floated skipping straight to v3. The matmul and grid sparse-parity challenges are early instances of the v3 model in practice. SutroYaro Task 015 (added May 11) is the cross-repo index for this instrumentation work.

GitHub State Across Repos

Repo State Recent
SutroYaro 6 issues, 0 PRs PR #97 merged May 14 (migrated research code out to sparse-parity-challenge). Reshuffle #96 still open.
sutro-problems 1 open PR (#7) Home of the matmul, sparse-parity, and symmetry challenges. Active.
hinton-problems 2 open PRs (#60, #61) 53 stubs shipped May 3. v2 ByteDMD instrumentation underway.
schmidhuber-problems 0 open PRs 58 stubs shipped May 8.
wikitext 0 open PRs New repo, split from sutro-problems May 12. Very active.
sparse-parity-challenge 1 open PR (#43) Received the SutroYaro migration (PR #40). Submission issue #41 pending.
ByteDMD 2 issues, 1 PR, all old Quiet. Last default-branch commit Apr 30.

New and Returning People

  • Armins (handle falsefalsenottrue): leading the wikitext challenge.
  • Gabriel Nakajima An: wikitext energy-metering work.
  • Cosmin Negruseri: matmul auto-research, now attending in person.
  • Sung Jae Bae: matmul and grid sparse-parity contributor.
  • Louka Ewington-Pitsos: new, reading into forward-forward.
  • Anastasiia Zhiboedova: SutroAna harness, matmul work.
  • Russ Pantone: May 4 guest speaker (ex-Rain.AI).

What's Open and What's Next

For SutroYaro specifically

  • Reshuffle issue #96 is partly done. PR #97 removed the migrated research code. Decide what the "strip to lab memory" plan still requires and whether #96 can be narrowed or closed.
  • docs_config.json does not track the two Yaroslav strategy docs (higher-level thinking, bounds summary) or any meeting notes after Meeting #9. Decide which to add so future Google Docs syncs pick them up.
  • The CLAUDE.md "Current Best Methods" table is still pre-ByteDMD. Task 015 is the natural home for re-measuring it.

Decisions in flight elsewhere

  • Whether wikitext counts an explicit CPU energy term (RAPL) or stays GPU-only. Live in the wikitext channel as of today.
  • Whether the grid sparse-parity challenge keeps floating-point ops or moves to an integer / 8-bit instruction set.

Cadence

Last catch-up was Apr 30. Next one should land before the Monday May 25 meeting to keep the weekly rhythm.