Changelog¶

All notable changes to this research workspace.

[0.31.0] - 2026-05-29¶

Auto-research-loop dispatcher kit¶

docs/auto-research-loop/index.md: a driver brief for running the agent-teams paper-reproduction loop on a new researcher. Distills the hinton-problems and schmidhuber-problems builds into the reusable parts: the wave model, copy-paste TeamCreate and worker-prompt templates (machine paths generalized to placeholders so a second operator can drive it), the measured human-in-the-loop cost (8 of 40 prompts load-bearing, ~25.7 lead turns per prompt), and a "make it a transferability test" section. Serves the repo's dispatcher role per CLAUDE.md.

docs/auto-research-loop/lecun-spec-draft.md: a worked LeCun SPEC draft with a method-family wave plan (ConvNets, training tricks, invariance, energy-based, sparse coding, self-supervised), each stub tied to a verified paper and year. Seed for a future lecun-problems build. Not wired into the mkdocs nav; publishing is a separate decision.

[0.30.0] - 2026-05-20¶

3-week catch-up (Apr 30 to May 20)¶

docs/catchups/2026-05-20.md: first catch-up in three weeks. Research activity has moved out of SutroYaro into a four-challenge repo constellation; the hinton-problems and schmidhuber-problems catalogs shipped; the wikitext challenge launched. All 11 Telegram forum topics are now synced (was 6).

Four-challenge index¶

docs/challenges.md: new challenge-centric index. Maps all four challenges (sparse parity, energy-efficient matmul, sparse parity on the grid, wikitext) plus the two baseline catalogs to repo, current state, who is active, and entry point. Linked from the homepage and nav; the homepage "The Challenges" table was updated off its stale three-variant list.

Reshuffle audit (issue #96)¶

docs/research/reshuffle-audit.md: the Phase 1 per-asset triage that issue #96 asked for, written after the migration (PR #40 / #97) ran without it. Records what is residue, what is confirmed in sparse-parity-challenge, and what the strip still needs. No code was deleted.

Telegram sync covers all topics¶

src/telegram/sync.ts and sync_telegram.ts: TOPICS_TO_SYNC expanded from 6 to all 11 forum topics, so challenge #2, challenge #3, wikitext, Pitch / Talking Points, and makemore task results are now picked up.

Doc framing¶

CLAUDE.md: docs/challenges.md added to "Read These First"; "Core Concepts" noted as Challenge #1 specific.
DISCOVERIES.md: the stale Challenge 2: Sparse Sum / Challenge 3: Sparse AND headers renamed to Sibling task: (they collided with the current challenge numbering); scope pointer to the challenge index added.
.gitignore: ignore .antigravitycli/.

[0.29.0] - 2026-04-25¶

ByteDMD floor-gap survey (PR #87 Seth + PR #88 Yad)¶

Answers Yaroslav's Apr 20 question: how far are current solutions from the ByteDMD floor?

Method	ByteDMD	vs read-n floor (n=20: 70)	Geometric LB (0.3849×)	Correct
KM-min (1 sample/bit)	268	3.8×	103	✓
GF(2) Gauss elim	101,501	1,450×	39,068	✓
Fourier (Walsh, N=50)	5,156,954	73,671×	1,984,912	✓
SGD demo (tiny)	84,592	1,208×	32,559	✗ (chance level)

Pure-Python implementations so ByteDMD tracks every read (no numpy escape hatches). KM-min and GF(2) shipped first via Seth's PR #87; Fourier + SGD-demo + the geometric LB column shipped via PR #88 stacked on top. Both squashed into main together.
Geometric lower bound factor (0.3849) wired in per Yaroslav's 2026-04-23 clarification — 0.3849 × measured ByteDMD lower-bounds the actual VLSI allocation cost of the "best" oracle allocator. Requires live-byte counting (current ByteDMD post-PR #80). Proof: Gemini DeepThink, reviewed with Toranosuke Ozawa at cybertronai/ByteDMD/.../tarjan-detailed-part1.pdf.
Headline finding: KM-min sits at 3.8× the floor (with oracle-paired inputs); algebraic methods span 3 orders of magnitude. Fourier's cost is dominated by O(C(n,k)) subset enumeration. SGD-demo with a converging config would be 3-4 orders higher than the chance-level number shown.
Open question on PR #87: Q2 (oracle-query KM-min as fair reference floor even though not benchmark-submittable) — still awaiting Yaroslav.

Nix devShell + Task 11 docs (PR #84, PR #85, Andy)¶

PR #84: pkgs.sqlite added to the Nix devShell so sqlite3 CLI is one nix develop away. Used by sutro-sync / weekly-catchup / prepare-meeting skills for telegram.db queries.
PR #85: DeepSeek Engram offload observation promoted from issue #77 into durable docs at docs/tasks/Task 11/ (task spec + reusable agent prompt + findings path), following the #73 / Muon pattern. Discoverable to non-GitHub-indexed agents (Gemini, Qwen, Kimi).

Issue cleanup pass (9 more closed)¶

Closed against PR #82 evidence (sprint work landed earlier but issues weren't auto-closed): #7, #8, #27, #30, #43, #56, #61.
Closed in favor of Task 11 promotion: #77 (now lives at docs/tasks/Task 11/).
Closed as superseded with explanatory comment: #60 (philoengineer's Telegram MTProto CLI scripts, replaced by SQLite-backed bin/tg-{sync,post,auth} from v0.26.0).

`follow-up` label introduced¶

Yellow #FBCA04, description "Spec posted, awaiting follow-up implementation." Applied to remaining open issues that have substantive specs but haven't been built yet: #5, #14, #54.

Auto-generated repo diagrams (PR #91)¶

Both docs/research/repo-tree.md (D3 interactive tree) and docs/research/repo-layout.md (Mermaid graph) now regenerate from docs/research/_diagrams.yaml via bin/regen-diagrams. Pan/zoom HTML and JS are untouched — only the data blocks (between BEGIN_AUTOGEN / END_AUTOGEN markers) get rewritten.

CI workflow diagram-staleness.yml runs bin/regen-diagrams --check on PRs that touch the YAML or generated files; fails the build if a regen would change anything, with the fix in the error message.

Drift caught on first regen: - findings_count: hardcoded 38 → actual 41 (auto-counted via glob) - task_count: hardcoded "specs 1-10 + INDEX" → actual 11 - Tree was missing GEMINI.md (added Apr 21 via PR #72) - Tree was missing challenges/ (added Apr 20 via PR #82) - bin/ was missing complexity-check, score-all, regen-diagrams

Auto-counts computed at regen time: - findings_count: glob findings/exp_*.md - experiments_jsonl_count: line count of research/log.jsonl - task_count: glob docs/tasks/[0-9]*-*.md

Contributor workflow: edit _diagrams.yaml → run bin/regen-diagrams → commit YAML + regenerated .md files. CI catches PRs that edit the YAML without regen-ing.

Documentation hygiene¶

Broken mkdocs link fixed in docs/tasks/009-muon-review.md — ../../research/search_space.yaml (lives outside docs tree) replaced with a GitHub URL. mkdocs build --strict now passes clean (only the unrelated MkDocs 2.0 framework deprecation banner remains).
Google Docs sync committed (PR #89) — 13 doc files refreshed (+937 lines across meeting notes, knowledge sprints, bigger-picture); docs/references_auto.md link harvest +184 entries; docs/superpowers/plans/2026-03-16-egd-sparse-parity.md (March plan that was untracked); root package-lock.json checked in next to its package.json.

Pull requests landed¶

PR	Title	Author
#84	Add sqlite to devShell for telegram.db queries	@zh4ngx
#85	docs: Task 11 — DeepSeek Engram offload ByteDMD verification	@zh4ngx
#87	exp: ByteDMD floor-gap survey — KM-min 268 vs GF(2) 101,501	@SethTS
#88	exp: extend floor-gap survey with Fourier + SGD-demo + geometric LB	@0bserver07
#89	docs(sync): Google Docs refresh + untracked plans/lockfile cleanup	@0bserver07
#91	diagrams: single source of truth + bin/regen-diagrams + CI staleness check	@0bserver07

[0.28.0] - 2026-04-20¶

Issue sprint and repo hygiene¶

14 open issues closed against repo evidence in a single triage pass: #4, #6, #21, #24, #25, #26, #28, #29, #31, #58, #64, #67, #69, #71. Each close comment links to the file, PR, or finding that resolved it.
Branch protection enabled on main: 1 required approval, admin override, no force-push, no deletion. Closes #71.
5 stale branches deleted after merge or intentional close: YAD/adopt-bytedmd, yaroslav/tracked-numpy, v-alpha, YAD/telegram-sqlite, fix/claude-md-people-handles. Closes #69.

Parallel-agent sprint (PR #82, Yad)¶

bin/complexity-check: repo size + file-count snapshot, appends JSONL to .complexity-log.jsonl, flags >10% python-line growth. Closes #8.
bin/score-all: discovers solve_*.py submissions, runs each under bytedmd(), writes sorted TSV to results/scoreboard.tsv. Closes #61 (the GF(2) undercount piece was moot post-ByteDMD).
LAB.md two-phase results section: evidence bundle in results/<exp_id>/, narrative in findings/<exp_id>.md. Closes #27.
Onboarding pass: first-experiment walkthrough in docs/getting-started.md, skills table, --dangerously-skip-permissions explanation, ByteDMD note, AGENT.md redirect to LAB.md for humans. Closes #56.
Skills scaffolding: examples/ subdirs added to 4 skills missing them, references/ subdirs on all 6 skills pointing at canonical docs (LAB.md, AGENT.md, DISCOVERIES.md, sync-runbook.md, telegram-setup.md, bytedmd.md). Closes #30.
Pip entry points: [project.optional-dependencies] (eval, modal, dev, all) and [project.entry-points."gymnasium.envs"] wired through sparse_parity.eval:register_all. Closes #43.
Three new challenges: majority-vote, threshold, noisy-parity in src/sparse_parity/challenges/ using the existing registry pattern. LAB.md rule #9 respected (src/harness.py untouched). measure_* signatures reject unknown kwargs (no silent **kwargs typo swallowing). Closes #7.

Repo diagrams with pan/zoom (PR #83)¶

docs/research/repo-layout.md: Mermaid graph of the four-layer workspace (Docs / Code / Ops / Artifacts / Site) with svg-pan-zoom controls (＋ − ⤾ buttons, scroll to zoom, drag to pan).
docs/research/repo-tree.md: interactive D3.js click-to-expand tree with native d3-zoom and the same control scheme.
docs/diagrams/: dropped stale v1 agent-workflow files, promoted v2 to canonical.

Agent compatibility (PR #72, Andy)¶

GEMINI.md added for Gemini CLI / Antigravity agents, matching the existing CLAUDE.md / CODEX.md / AGENTS.md set. Closes #13.
AGENTS.md context reference fix (PR #74).

NoProp + Forward-Forward follow-up (PR #79, Seth)¶

Forward-Forward baseline added alongside the NoProp experiment. Denoising beats contrastive in this setting.
Both experiments flagged as measured under legacy MemTracker, not ByteDMD — pending re-run once metrics stabilise.

ASI-Evolve research reports (PR #81)¶

Three parallel-agent reports mapping the ASI-Evolve evolutionary framework onto SutroYaro's ByteDMD-optimized learning rule search.
Agent prompts at docs/agent-prompts/asi-evolve/{algorithms-claude, memory-kimi, execution-*}.md.

Muon review prompt (PR #73)¶

Reusable review prompt for Muon-related contributions.
Muon findings moved from findings/ to canonical docs/findings/ path.

Dev-shell and environment¶

.envrc restored for direnv auto-loading of the Nix devShell.
nodejs added to the devShell so Claude Code hooks can run.

Specs posted as follow-up comments¶

#77 — ByteDMD microbenchmark spec for DeepSeek Engram's <3% SSD offload claim.
#9 — Modal nanoGPT energy baseline spec: bin/gpu_energy_nanogpt.py + L4 GPU + nvidia-smi integration.
#14 — Post-task Telegram notify hook spec: .claude/hooks/post-task-notify.cjs on Stop event, triple-gated.
#54 — Pikiclaw evaluation criteria before building a custom PR-review pipeline from scratch.
#5 — Flagged --challenge flag gap in bin/run-agent for sparse-sum and sparse-and challenges.

[0.27.0] - 2026-04-14¶

ByteDMD adopted as primary metric¶

Vendored cybertronai/ByteDMD at src/bytedmd/
13 tests (10 core + 3 gotchas) all pass
TrackedArray retained as legacy tracker for existing experiments (30 tests still pass)
New docs page: docs/research/bytedmd.md
Decision made by Yaroslav after meetings with Wesley Smith and Bill Dally feedback. Byte-granularity rewards smaller dtypes; pure Python eliminates numpy escape hatches.
Existing challenge submissions stay on legacy DMC. New submissions use ByteDMD.

[0.26.0] - 2026-03-28¶

Largest release: auto-instrumented DMD tracking, RL eval environment, agent infrastructure, Telegram SQLite sync, and documentation fixes. 84 commits from 4 contributors.

Auto-instrumented DMD tracking (Yaroslav)¶

TrackedArray: numpy ndarray wrapper that auto-tracks all operations without manual instrumentation. Wrap inputs, run unmodified code, read DMD.
LRUStackTracker: true per-element LRU stack distances matching Ding et al. Definition 2.1. Writes place data on the stack (free). Only reads cost DMD = sqrt(stack_distance). No cold misses -- inputs arrive pre-loaded.
GF(2) under-counting fixed: harness reported DMC 8,607 but actual DMD with all row operations tracked is ~203K. The leaderboard was wrong by 24x.
Verified against known examples: paper example (abbbca, dist=3) and exact (a+b)+a prediction (DMD = 5.146).
30 tests organized by concern: wrapper mechanics, indexing, numpy functions, LRU metric, GF(2) integration.
Docs: research/tracked-numpy.md with worked examples and per-operation DMD breakdown.

RL evaluation environment (Yad, PR #49)¶

Gymnasium environments: SparseParity-v0 (single challenge) and MultiChallenge-v0 (all three challenges). Agent picks from 16 methods and gets scored on research quality.
12-category grading rubric (72 points): checks if agent found GF(2), noticed ARD/DMC disagreement, observed local learning failing, explored broadly, found the algebraic solver efficiently.
16 methods all runnable: 5 via harness, 9 with live fallback, 2 cached.
Platform adapters: Anthropic tool-use, PrimeIntellect verifiers, HuggingFace Spaces leaderboard, UK AISI Inspect.
Registry system: add methods and challenges without editing environment code.
Demo script, system overview page, eval docs.

Agent infrastructure (Yad, PR #50)¶

3 hooks: session-start (shows project status), security-guard (blocks edits to measurement code), session-end (session summary).
2 rules: experiment reproducibility (seeds, config, environment logging), agent coordination (file ownership, parallel dispatch criteria, post-merge guidance).
4 skills: run-experiment (two-phase protocol), weekly-catchup, prepare-meeting, info-defrag.
LAB.md rules #10 (two-phase output) and #11 (reproducibility) added.
Docs: docs/research/agent-infrastructure.md, workflow diagram.

Telegram integration (Yad, Issue #58)¶

bin/tg-sync: incremental Telegram to SQLite sync. First run: full backfill. Subsequent runs: only new messages.
bin/tg-post: post to forum topics via Bot API using per-person bot tokens. Safety guard: disabled by default, requires TELEGRAM_POST_ENABLED=1.
bin/tg-auth: standalone interactive MTProto login (replaces tg auth login dependency).
SQLite schema: messages(id, topic_id, date, sender, text, reply_to). Setup guide: docs/tooling/telegram-setup.md.

Documentation fixes (Yad, Issue #55)¶

Experiment counts updated from 33/34 to 36 across 8 files.
Seth Stafford's bio updated with GrokFast PRs.
Timeline extended through March 24, system overview and catchup index updated.

[0.25.0] - 2026-03-24¶

GrokFast + Curriculum scaling frontier (PR #53, SethTS)¶

Maps how far GrokFast + Curriculum scales. k=3 scales effortlessly: n=200 solves in 11 epochs / 95ms, each expansion phase takes 1 epoch. k=5 hits a wall between n=50 and n=100 (60% solve rate, stalls at 94% after expansion). n=200/k=5 fails completely. The 5-way interaction detector learned at small n is too fragile to survive 50+ new noise dimensions.

Findings: findings/exp_grokfast_curriculum_scale.md

[0.24.0] - 2026-03-23¶

GrokFast + Curriculum compounding (PR #52, SethTS)¶

GrokFast and curriculum attack different axes (k-th order plateau vs n-scaling wall). Combined, they multiply:

n=20/k=5: 5.8x speedup over SGD (12 epochs, 57ms)
n=50/k=3: 8.3x speedup (7 epochs, 35ms)
n=50/k=5: solves in 14 epochs / 77ms where SGD fails completely (0% at 1000 epochs)

Curriculum shields GrokFast from its noise-dimension weakness by keeping n small during the critical learning phase. 60 runs, 5 seeds each, 100% solve rate on all combined configurations.

Findings: findings/exp_grokfast_curriculum.md

[0.23.0] - 2026-03-23¶

GrokFast v2 experiment (PR #51, SethTS)¶

First external contribution. Seth tested GrokFast across 3 difficulty regimes (75 total runs, 5 seeds each):

WIN on n=20/k=5: aggressive GrokFast (a=0.98, l=2.0) gives 2.5x fewer epochs (29 vs 73) and 2.3x faster wall time than SGD. The EMA accumulates the exponentially weak k-th order gradient signal.
LOSS on n=30/k=3: 40% solve rate. More noise dimensions means the EMA amplifies noise.
NEUTRAL on n=20/k=3: mild settings match SGD. Confirms exp4 finding that GrokFast is counterproductive when hyperparams are already correct.

The critical variable is interaction order (k), not input dimension (n). Mild GrokFast (a=0.95, l=1.0) was never worse than SGD across any regime.

Findings: findings/exp_grokfast_v2.md
Experiment: src/sparse_parity/experiments/exp_grokfast_v2.py

[0.22.0] - 2026-03-22¶

DMC baseline sweep, optimization, and infrastructure (Issues #15-#22)¶

Headline: DMC rankings disagree with ARD. New best method found.

DMC baseline sweep (#17): Measured all 5 methods across 3 challenges (sparse-parity, sparse-sum, sparse-and). 14 total runs. GF2 wins DMC on parity (8,607) despite KM winning ARD (92). Fourier's DMC is 78 billion (9M times worse than GF2).
DMC optimization (#22): KM-min achieves DMC of 3,578 -- 58% lower than GF2 baseline. Single influence sample per bit suffices because parity influence is deterministic (exactly 0 or 1). Also discovered GF2's harness-measured DMC is artificially low (true DMC with fine-grained tracking: 189,056).
fast.py tracker integration (#15): Added optional tracker parameter to fast.py. Zero overhead when disabled. Reports ARD 7,210 / DMC 850,131 for default config.
Scoreboard backfill (#16): Filled DMC values for 21 of 35 scoreboard rows. 12 rows marked needs_measurement. Added dmc_source column to distinguish measured vs estimated values.
DMC visualization (#18): Created src/plot_dmc.py with 3 plots: DMC-vs-ARD scatter, parity ranking bar chart, cross-challenge comparison. Output in results/plots/.
Weekly catch-up section: Added docs/catchups/ with first entry (Mar 16-22). Covers Meeting #9 outcomes, Telegram activity, infrastructure inventory.
Meeting #9 notes synced: Added internal notes doc to Google Docs sync pipeline.
Public Domain license (#20): Added Unlicense to repo root.
8 new GitHub issues created: #15-#22 covering DMC infrastructure, optimization, RL env prototype, license, and Mar 30 meeting prep.

Results¶

Method	ARD	DMC	Rank shift
KM-min (new)	20	3,578	-- (new best)
GF2	420	8,607	ARD #2, DMC #1 (was)
KM (5 samples)	92	20,633	ARD #1, DMC #2
SGD	8,504	1,278,460	Same
Fourier	11,980,500	78,140,662,852	Same

Findings: findings/exp_dmc_optimize.md, results/dmc_baseline_sweep.md
Plots: results/plots/dmc_vs_ard.png, dmc_ranking_parity.png, dmc_cross_challenge.png

[0.21.0] - 2026-03-16¶

Egalitarian Gradient Descent experiment (Issue #4)¶

Implemented EGD (arXiv:2510.04930): replaces gradient singular values with 1 via SVD, equalizing learning rates across all directions.
CPU experiment (exp_egd.py): EGD halves the grokking plateau. 14 epochs to 90% vs SGD's 33 (2.6x fewer). Solves in 21 vs 40. Both at lr=0.1.
GPU experiment (gpu_egd.py via Modal L4): EGD is 12% slower in wall time despite 2x fewer epochs. SVD overhead per batch outweighs epoch savings.
Sparse sum comparison: SGD diverges at lr=0.1 (0/5 seeds), EGD solves 5/5. SVD normalization removes gradient magnitude, preventing scale-related divergence.
Sub-10ms not achievable. Small hidden (50) is capacity-limited for both methods.
Findings: findings/exp_egd.md

[0.20.0] - 2026-03-15¶

SGD speed sweep and research hypotheses (Issue #4)¶

Swept SGD configs: standard SGD floors at ~70-116ms (7 grokking epochs). Can't hit 10ms.
Tested L-BFGS: 35-60ms. Faster but still needs many function evaluations.
Tested Sign SGD: best single run 7.6ms (h=50, n=500, lr=0.1) but only 3/5 seeds solve. With n=1000 all 5 seeds solve at mean 29ms.
Added 8 research hypotheses to TODO.md with paper references: EGD, Grokfast (corrected), GrokTransfer, warm start from GF(2), lottery ticket, higher weight decay, curriculum+EGD, L-BFGS.
Deleted findings/gpu_energy_baseline.md (contained inflated results from earlier pynvml run)
Cleaned up homepage and tooling page references
Findings: findings/exp_sgd_speed.md

[0.19.0] - 2026-03-14¶

Added bin/gpu_energy.py: runs GF(2), SGD, KM on NVIDIA L4 via Modal using PyTorch CUDA (matching Yaroslav's gpu_toy.py approach)
Finding (5 runs): GPU is slower than CPU for all methods at n=20/k=3. SGD mean 1446ms GPU vs 142ms CPU (10x slower). KM mean 869ms vs 1.1ms (790x slower). GF(2) 2.0ms vs 0.5ms (4x slower). Tensors too small for CUDA overhead to amortize.
The ARD vs DMC proxy comparison remains unanswered. Needs nanoGPT-scale workloads.
Added findings/_experiment_template.md with required sections (Question, What was performed, What was produced, Can it be reproduced, Finding)
Added integrity rules to AGENT.md (don't inflate results, classify honestly)
Added scripts table to tooling overview page
Updated README project structure
Findings: findings/exp_proxy_comparison.md

[0.18.0] - 2026-03-14¶

Reproduce-all script¶

Added bin/reproduce-all: runs all 14 experiments across 3 challenges, verifies results match baselines
Supports --budget MS flag to skip experiments over a time budget (Yaroslav's "Spark 7 constraint": only run what fits in 1980s compute budgets)
All 14 experiments reproduce in 0.28 seconds. With --budget 10 (10ms), 6 pass, 8 skip, 0.08 seconds total
GF(2) on sum/and marked SKIP (expected fail, not a regression)

[0.17.0] - 2026-03-14¶

Three challenges, adding-a-challenge guide, Antigravity validation¶

Added sparse sum (Challenge 2) and sparse AND (Challenge 3) to the harness
Created docs/research/adding-a-challenge.md: step-by-step guide for agents and contributors to add new tasks
Sparse AND was added by Google Antigravity (agent IDE) following the guide without human help, validating that the guide works for autonomous agents
Harness now supports --challenge flag: sparse-parity (default), sparse-sum, sparse-and
Sparse sum baselines: SGD 100% in 1 epoch (ARD 20), KM 100% (ARD 92)
Sparse AND baselines: SGD 100% in 4 epochs, KM needs 20 samples (not 5) due to 1/2^(k-1) influence signal
Updated search_space.yaml, questions.yaml, DISCOVERIES.md, TODO.md, baseline_check.py for both challenges

[0.16.0] - 2026-03-14¶

Repo migration, multi-topic Telegram sync, and skills¶

Moved repo from 0bserver07/SutroYaro to cybertronai/SutroYaro (updated 15 files, git remote, GitHub Pages URL)
Telegram sync now pulls 6 topics in priority order (chat-yad, chat-yaroslav, challenge #1, General, In-person, Introductions) to separate JSON files
Added sutro-sync skill: session-start routine for Telegram, Google Docs, GitHub checks
Added sutro-context skill: research context loader (DISCOVERIES.md, open questions, recent discussion)
Re-synced all 15 Google Docs with upstream changes
Both of Andy's PRs merged (#2 TODO cleanup, #3 GF(2) noise experiment), Issue #1 closed
Deploy workflow confirmed working on new org

[0.15.0] - 2026-03-11¶

Peer research protocol and autonomous agent infrastructure¶

Inspired by analysis of Karpathy's autoresearch and trevin-creator's Tiny-Lab, but built for multi-researcher use with our own naming conventions.

New files: - AGENT.md -- machine-executable experiment loop for autonomous sessions - src/harness.py -- locked evaluation harness (GF2, SGD, KM, Fourier, SMT) with CLI - research/search_space.yaml -- bounded mutation space (16 methods, allowed parameter values) - research/questions.yaml -- dependency graph of 12 research questions (9 resolved, 6 open) - research/log.jsonl -- all 33 experiments from DISCOVERIES.md in machine-readable format - results/scoreboard.tsv -- auto-generated leaderboard from log.jsonl - results/progress.png -- ARD progress chart over experiment history - checks/env_check.py -- pre-flight environment verification - checks/baseline_check.py -- re-establish GF2/SGD/KM baselines per machine - bin/run-agent -- tool-agnostic launcher with looped mode, circuit breaker, PID lock - bin/merge-findings -- merge contributor log.jsonl entries via PR - bin/analyze-log -- progress report and chart generation - docs/research/peer-research-protocol.md -- full design doc with nanoGPT migration proposal

Design choices: - Tool-agnostic: bin/run-agent --tool claude|gemini|custom works with any AI CLI - Looped mode: multiple short cycles with fresh context, resilient to crashes - Circuit breaker: halts if 5+ INVALID in last 20 experiments - Harness integrity: SHA256 verified before and after each run - All state in files: any tool that reads/writes files can participate - Challenge-agnostic log schema: challenge field supports sparse parity now, nanoGPT later - Researcher attribution: researcher field in log entries for peer merge

Updated files: - CLAUDE.md -- added autonomous research infrastructure section, harness to isolation rule - results/scoreboard.tsv -- generated from full 33-experiment log

[0.14.0] - 2026-03-11¶

Feedback tasks from Meeting #8¶

Added Data Movement Complexity (DMC) metric to MemTracker (Ding et al., arXiv:2312.14441). DMC = sum of sqrt(stack_distance) for all float accesses. Baseline: ARD 4,104 / DMC 300,298.
Confirmed stack distance already implemented (tracker clock advances by buffer size, not instruction count)
Added metric isolation rule to LAB.md (rule #9): agents cannot modify measurement code
Created task tracker in docs/tasks/ with 6 tasks from Meeting #8 feedback
Updated baselines table in LAB.md with DMC column
Reproduced Germain's hidden=64 result: ARD drops 68% but ARD/float is identical (0.36). Smaller model, not better locality.
Reviewed linear classifier paper (arXiv:2309.06979): CoT-based, not applicable to one-shot sparse parity benchmark

[0.13.0] - 2026-03-10¶

Meeting #8 docs and sync runbook¶

Synced 6 new Google Docs from Meeting #8 (09 Mar 2026): notes, AI notes, Yaroslav knowledge sprint 2, Yaroslav GF(2) verification, Michael's Claude approach, The Bigger Picture roadmap
Added all new docs to MkDocs nav with cross-reference headers
Updated meetings index and notes pages with Meeting #8 summary
Created sync runbook (docs/tooling/sync-runbook.md) with weekly/daily/per-session checklists
Added GitHub PR/issue checking to CLAUDE.md "Before Pushing" checklist
Total synced Google Docs: 15 (up from 9)

[0.12.0] - 2026-03-09¶

GF(2) noise tolerance experiment¶

Added exp_gf2_noise experiment testing algebraic solver with label noise
Key finding: Basic GF(2) fails at 1% noise; robust subset-sampling recovers up to 10-15%
New experiment: src/sparse_parity/experiments/exp_gf2_noise.py
Findings: findings/exp_gf2_noise.md
Updated DISCOVERIES.md with noise tolerance results

[0.11.0] - 2026-03-07¶

Homepage and documentation refresh¶

Rewrote homepage as a proper introduction (was jumping straight to "SOLVED")
Updated context page with Phase 2 findings and timeline
Synced 8 Google Docs with upstream changes (new Bookmarks section, Yaroslav link)
Added "How to find things in Sutro Group" doc to sync config and nav
Fixed sync script to preserve cross-reference headers across re-syncs

[0.10.0] - 2026-03-07¶

Review fixes and project documentation¶

Fixed RL bandit tracker bug (stale loop variable i instead of arm_idx)
Re-sorted TL;DR table by verdict tier then speed, ranked Target Propagation as #33
Updated CLAUDE.md with Phase 2 results, best methods table, Telegram sync reference
Added AGENTS.md documenting how 17 parallel Claude Code agents were used
Added .env.example for Telegram API credentials
Gitignored messages.json (contains real group messages)

[0.9.0] - 2026-03-07¶

Phase 2: 17 experiments + Practitioner's Field Guide¶

Phase 2 experiments dispatched 17 independent Claude Code agents in parallel, each testing a different algorithmic approach:

Algebraic/Exact: GF(2) Gaussian elimination (509 us, 240x faster than SGD), Kushilevitz-Mansour influence estimation (ARD 1,585, 724x better than Fourier), SMT backtracking
Information-Theoretic: Mutual Information, LASSO, MDL Compression, Random Projections -- all solve it, none beats Fourier meaningfully
Local Learning Rules: Hebbian, Predictive Coding, Equilibrium Propagation, Target Propagation -- all failed at chance level (parity requires k-th order interaction detection)
Hardware-Aware: Tiled W1 (software ARD worsened), Pebble Game (2.2% energy savings), Binary Weights (fails at n=20)
Alternative Framings: Genetic Programming (exact formula but doesn't scale), RL Bit Querying (ARD of 1 at inference), Decision Trees (greedy splitting can't find secret bits)

Practitioner's Field Guide (docs/research/survey.md): 4,500-word survey ranking all 33 experiments with decision framework, 10 generalized principles, and full AI research methodology.

Telegram sync tooling: sync_telegram.ts pulls messages from Sutro Group topic threads via MTProto. Full setup guide in docs/tooling/automation.md.

Updated DISCOVERIES.md, mkdocs nav (survey + 17 findings pages), and research index.

[0.8.0] - 2026-03-04¶

Blank-slate approaches (Round 3, no neural nets, no SGD)¶

Rounds 1-2 were incremental variations on the same MLP+SGD recipe. Round 3 reframes the problem: sparse parity as a search problem, not a learning problem.

For k=3 parity with secret indices {a,b,c}, the label is x[a] * x[b] * x[c]. You don't need a neural network. You need to search over C(n,k) possible subsets and test which one matches the data.

Fourier / Walsh-Hadamard solver: Sparse parity IS a Fourier coefficient. For each candidate k-subset S, compute mean(y * product(x[:, S])). The true secret gives correlation ~1.0, everything else gives ~0. No training, no gradients, no iterations. Result: 13x faster than SGD (0.009s vs 0.12s for n=20/k=3), solves n=50/k=3 and n=20/k=5 trivially where SGD struggles. Only needs 20 samples. Scales to k=7 before combinatorial explosion (C(n,k) subsets).
Evolutionary / random search: Randomly sample k-subsets, test if product(x[:, subset]) matches all labels. For n=20/k=3 it takes ~881 random tries (0.011s). Evolutionary search with mutation+crossover solves it in fewer evaluations but more wall time. Solves n=50/k=3 in 0.14s, a config SGD fails on entirely.
Feature selection: Tried to decompose into "find the bits" then "classify." Exhaustive combo search works (178-1203x fewer ops than SGD). But the clever approaches fail: pairwise correlations are provably zero for parity (E[y * x_i * x_j] = 0 for ALL pairs, even correct ones). Greedy forward selection also fails. Parity has zero low-order statistical signatures. You must test the full k-way interaction. Neural nets need so many iterations because they implicitly search the combinatorial space via gradient descent.

When to use what: - k ≤ 7: Fourier/random search wins (milliseconds, exact, guaranteed) - k ≥ 10: C(n,k) explodes, SGD's implicit search via gradients becomes the only feasible path - k = 8-9: hybrid approaches may work (combinatorial search with pruning)

[0.7.0] - 2026-03-04¶

Research experiments (Round 2, autonomous agent team)¶

Round 2 explored variations on the working SGD solution: different optimizers, hyperparameter sweeps, and energy measurement improvements.

Sign SGD: Replace gradient with its sign: W -= lr * sign(grad). Normalizes gradient magnitudes, helping detect sparse features. Solves k=5 2x faster than standard SGD (7 vs 14 epochs). Standard SGD also solves k=5 with enough data (n_train=5000). The exp_d "failure" at 61.5% was insufficient data, not algorithm limits.
Curriculum learning: Train on easy configs first, then expand the network. n=10/k=3 (instant) → expand W1 with zero-padded columns → n=30/k=3 (1 epoch) → n=50/k=3 (1 epoch). Result: 14.6x speedup, cracks n=50 which direct training can't. The learned feature detector transfers because new columns start near zero.
Cache-aware MemTracker: Built CacheTracker with LRU cache simulation to fix the broken ARD metric. Finding: L2 cache (256KB) eliminates ALL misses for both methods. Batch wins on total traffic (13% fewer floats, 16x fewer writes), not cache locality. Single-sample is more L1-friendly than batch.
Weight decay sweep: Swept WD across [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0]. WD=0.01 already optimal. Only [0.01, 0.05] achieves 100% success rate. The effective LR*WD must be in [0.001, 0.005], an extremely narrow range.
Per-layer + batch: Combined per-layer forward-backward with mini-batch SGD. Converges but not useful: single-sample SGD is 8x faster in epochs, and the per-layer re-forward pass adds 3.7x wall-time overhead.

[0.6.0] - 2026-03-04¶

Research experiments (Round 1, measuring energy on the working solution)¶

With 20-bit solved, round 1 asked: how energy-efficient is our solution, and can we improve it?

Exp A: ARD on winning config: Measured energy proxy (ARD) on the working config. Per-layer forward-backward gives 3.8% improvement (17,299 vs 17,976 floats). W1 (the big weight matrix) dominates at 75% of all float reads, capping operation-reordering improvements at ~10%.
Exp B: Batch ARD: Batch-32 shows 17x higher ARD in our metric, but does 16x fewer parameter writes. The ARD metric doesn't model cache. On real hardware where W1 fits in L2, batch would win. This exposed a limitation of our measurement tool.
Exp C: Per-layer on 20-bit: Update each layer before proceeding to the next. Converges identically to standard backprop (99.5%, same epoch count). Free 3.8% ARD improvement with zero accuracy cost.
Exp D: Scaling frontier: Mapped where standard SGD breaks. k=3 works to n~30-45. n=50/k=3 fails (54%). k=5 is impractical at any n (~200,000 epochs for n=20). The boundary is at n^k > 100,000 iterations, matching the theoretical SQ lower bound.
Exp E: Forward-Forward: Hinton's local learning algorithm (no backward pass). Solves 3-bit but fails 20-bit (58.5%). Has 25x WORSE ARD than backprop, opposite of hypothesis. The locality advantage requires 10+ layer networks; our 2-layer MLP is too small to benefit.
Exp F: Prompting strategies: Documented the meta-workflow: literature search → compare against published baselines → diagnose the gap → fix → verify. The most effective prompt was asking Claude to compare our hyperparams against Barak et al. 2022.

Speed¶

fast.py: numpy-accelerated training solves 20-bit in 0.12s average (220x faster than pure Python). hidden=200, n_train=1000, batch=32.

Infrastructure¶

LAB.md: protocol for autonomous Claude Code experiment sessions
DISCOVERIES.md: accumulated knowledge base from all experiments
_template.py: copy-and-modify experiment starter

[0.5.0] - 2026-03-04¶

Solving 20-bit sparse parity¶

The pipeline from v0.4.0 got 54% accuracy on 20-bit, a coin flip. Literature review of 6 papers diagnosed the gap.

The problem was hyperparameters, not the algorithm. Our LR=0.5 was 5x too high (literature uses 0.1), we used single-sample SGD instead of mini-batch (batch=32), and had too little training data (200 vs 500+ samples). One arxiv search fixed all of it.

Exp 1: Fix Hyperparams: Changed LR 0.5→0.1, batch_size 1→32, n_train 200→500. Result: 99% accuracy with classic grokking, stuck at 50% for 40 epochs, then phase transition to 99% in ~10 epochs. The hidden progress metric ||w_t - w_0||_1 grew steadily throughout, confirming Barak et al. 2022's theory.
Exp 4: GrokFast: Tested the EMA gradient filter from Lee et al. 2024 (amplify slow gradient components to accelerate grokking). Counterproductive. With correct hyperparams, baseline SGD hits 100% in 5 epochs (22.7s). GrokFast took 12 epochs and never reached 100%. Lesson: don't apply tricks designed for one regime (extended memorization) to another (fast convergence).
Literature review: 6 papers (Barak 2022, Kou 2024, Merrill 2023, GrokFast, NTK grokking, SLT phase transitions)
Research plan for autonomous experiment cycle
Results organized into per-run directories with auto-generated index

[0.4.0] - 2026-03-03¶

Added¶

Complete sparse parity pipeline (all 5 phases)
- 3 training variants: standard backprop, fused, per-layer
- MemTracker for ARD measurement
- JSON + markdown + plot output
20/20 tests passing
Per-run results directories with index lookup

[0.3.0] - 2026-03-03¶

Added¶

MkDocs site with Material theme
Mermaid diagrams in context and findings
Changelog tracking

[0.2.0] - 2026-03-03¶

Added¶

src/sync_google_docs.py: standalone script to sync Google Docs to markdown
Auto-extracted references (docs/references_auto.md)
Expanded homework archive covering all 7 meetings

[0.1.0] - 2026-03-03¶

Added¶

Initial research environment setup
Converted 3 Google Docs to local markdown:
- Challenge #1: Sparse Parity (spec)
- Sutro Group Main (meeting index)
- Yaroslav Technical Sprint 1 (sprint log)
Extracted 30+ hyperlinks into docs/references.md
Directory structure: docs, findings, plans, research, src
CLAUDE.md with project context for AI assistants
CONTEXT.md with project background and timeline
Sprint 1 findings documented
Sprint 2 plan drafted
Meeting notes index with all external links
Lectures and homework directories
.gitignore for Python/Jupyter/IDE files

Changelog¶

[0.31.0] - 2026-05-29¶

Auto-research-loop dispatcher kit¶

[0.30.0] - 2026-05-20¶

3-week catch-up (Apr 30 to May 20)¶

Four-challenge index¶

Reshuffle audit (issue #96)¶

Telegram sync covers all topics¶

Doc framing¶

[0.29.0] - 2026-04-25¶

ByteDMD floor-gap survey (PR #87 Seth + PR #88 Yad)¶

Nix devShell + Task 11 docs (PR #84, PR #85, Andy)¶

Issue cleanup pass (9 more closed)¶

follow-up label introduced¶

Auto-generated repo diagrams (PR #91)¶

Documentation hygiene¶

Pull requests landed¶

[0.28.0] - 2026-04-20¶

Issue sprint and repo hygiene¶

Parallel-agent sprint (PR #82, Yad)¶

Repo diagrams with pan/zoom (PR #83)¶

Agent compatibility (PR #72, Andy)¶

NoProp + Forward-Forward follow-up (PR #79, Seth)¶

ASI-Evolve research reports (PR #81)¶

Muon review prompt (PR #73)¶

Dev-shell and environment¶

Specs posted as follow-up comments¶

[0.27.0] - 2026-04-14¶

ByteDMD adopted as primary metric¶

[0.26.0] - 2026-03-28¶

Auto-instrumented DMD tracking (Yaroslav)¶

RL evaluation environment (Yad, PR #49)¶

Agent infrastructure (Yad, PR #50)¶

Telegram integration (Yad, Issue #58)¶

Documentation fixes (Yad, Issue #55)¶

[0.25.0] - 2026-03-24¶

GrokFast + Curriculum scaling frontier (PR #53, SethTS)¶

[0.24.0] - 2026-03-23¶

GrokFast + Curriculum compounding (PR #52, SethTS)¶

[0.23.0] - 2026-03-23¶

GrokFast v2 experiment (PR #51, SethTS)¶

[0.22.0] - 2026-03-22¶

DMC baseline sweep, optimization, and infrastructure (Issues #15-#22)¶

Results¶

[0.21.0] - 2026-03-16¶

Egalitarian Gradient Descent experiment (Issue #4)¶

[0.20.0] - 2026-03-15¶

SGD speed sweep and research hypotheses (Issue #4)¶

[0.19.0] - 2026-03-14¶

GPU measurement via Modal Labs (Issue #6)¶

[0.18.0] - 2026-03-14¶

Reproduce-all script¶

[0.17.0] - 2026-03-14¶

Three challenges, adding-a-challenge guide, Antigravity validation¶

[0.16.0] - 2026-03-14¶

Repo migration, multi-topic Telegram sync, and skills¶

[0.15.0] - 2026-03-11¶

Peer research protocol and autonomous agent infrastructure¶

[0.14.0] - 2026-03-11¶

Feedback tasks from Meeting #8¶

[0.13.0] - 2026-03-10¶

Meeting #8 docs and sync runbook¶

[0.12.0] - 2026-03-09¶

GF(2) noise tolerance experiment¶

[0.11.0] - 2026-03-07¶

Homepage and documentation refresh¶

[0.10.0] - 2026-03-07¶

Review fixes and project documentation¶

[0.9.0] - 2026-03-07¶

Phase 2: 17 experiments + Practitioner's Field Guide¶

[0.8.0] - 2026-03-04¶

Blank-slate approaches (Round 3, no neural nets, no SGD)¶

[0.7.0] - 2026-03-04¶

Research experiments (Round 2, autonomous agent team)¶

[0.6.0] - 2026-03-04¶

Research experiments (Round 1, measuring energy on the working solution)¶

Speed¶

Infrastructure¶

[0.5.0] - 2026-03-04¶

Solving 20-bit sparse parity¶

`follow-up` label introduced¶