Changelog¶
All notable changes to this research workspace.
[0.29.0] - 2026-04-25¶
ByteDMD floor-gap survey (PR #87 Seth + PR #88 Yad)¶
Answers Yaroslav's Apr 20 question: how far are current solutions from the ByteDMD floor?
| Method | ByteDMD | vs read-n floor (n=20: 70) | Geometric LB (0.3849×) | Correct |
|---|---|---|---|---|
| KM-min (1 sample/bit) | 268 | 3.8× | 103 | ✓ |
| GF(2) Gauss elim | 101,501 | 1,450× | 39,068 | ✓ |
| Fourier (Walsh, N=50) | 5,156,954 | 73,671× | 1,984,912 | ✓ |
| SGD demo (tiny) | 84,592 | 1,208× | 32,559 | ✗ (chance level) |
- Pure-Python implementations so ByteDMD tracks every read (no numpy escape hatches). KM-min and GF(2) shipped first via Seth's PR #87; Fourier + SGD-demo + the geometric LB column shipped via PR #88 stacked on top. Both squashed into
maintogether. - Geometric lower bound factor (0.3849) wired in per Yaroslav's 2026-04-23 clarification —
0.3849 × measured ByteDMDlower-bounds the actual VLSI allocation cost of the "best" oracle allocator. Requires live-byte counting (current ByteDMD post-PR #80). Proof: Gemini DeepThink, reviewed with Toranosuke Ozawa atcybertronai/ByteDMD/.../tarjan-detailed-part1.pdf. - Headline finding: KM-min sits at 3.8× the floor (with oracle-paired inputs); algebraic methods span 3 orders of magnitude. Fourier's cost is dominated by O(C(n,k)) subset enumeration. SGD-demo with a converging config would be 3-4 orders higher than the chance-level number shown.
- Open question on PR #87: Q2 (oracle-query KM-min as fair reference floor even though not benchmark-submittable) — still awaiting Yaroslav.
Nix devShell + Task 11 docs (PR #84, PR #85, Andy)¶
- PR #84:
pkgs.sqliteadded to the Nix devShell sosqlite3CLI is onenix developaway. Used bysutro-sync/weekly-catchup/prepare-meetingskills fortelegram.dbqueries. - PR #85: DeepSeek Engram offload observation promoted from issue #77 into durable docs at
docs/tasks/Task 11/(task spec + reusable agent prompt + findings path), following the #73 / Muon pattern. Discoverable to non-GitHub-indexed agents (Gemini, Qwen, Kimi).
Issue cleanup pass (9 more closed)¶
- Closed against PR #82 evidence (sprint work landed earlier but issues weren't auto-closed): #7, #8, #27, #30, #43, #56, #61.
- Closed in favor of Task 11 promotion: #77 (now lives at
docs/tasks/Task 11/). - Closed as superseded with explanatory comment: #60 (philoengineer's Telegram MTProto CLI scripts, replaced by SQLite-backed
bin/tg-{sync,post,auth}from v0.26.0).
follow-up label introduced¶
Yellow #FBCA04, description "Spec posted, awaiting follow-up implementation." Applied to remaining open issues that have substantive specs but haven't been built yet: #5, #14, #54.
Auto-generated repo diagrams (PR #91)¶
Both docs/research/repo-tree.md (D3 interactive tree) and docs/research/repo-layout.md (Mermaid graph) now regenerate from docs/research/_diagrams.yaml via bin/regen-diagrams. Pan/zoom HTML and JS are untouched — only the data blocks (between BEGIN_AUTOGEN / END_AUTOGEN markers) get rewritten.
CI workflow diagram-staleness.yml runs bin/regen-diagrams --check on PRs that touch the YAML or generated files; fails the build if a regen would change anything, with the fix in the error message.
Drift caught on first regen:
- findings_count: hardcoded 38 → actual 41 (auto-counted via glob)
- task_count: hardcoded "specs 1-10 + INDEX" → actual 11
- Tree was missing GEMINI.md (added Apr 21 via PR #72)
- Tree was missing challenges/ (added Apr 20 via PR #82)
- bin/ was missing complexity-check, score-all, regen-diagrams
Auto-counts computed at regen time:
- findings_count: glob findings/exp_*.md
- experiments_jsonl_count: line count of research/log.jsonl
- task_count: glob docs/tasks/[0-9]*-*.md
Contributor workflow: edit _diagrams.yaml → run bin/regen-diagrams → commit YAML + regenerated .md files. CI catches PRs that edit the YAML without regen-ing.
Documentation hygiene¶
- Broken mkdocs link fixed in
docs/tasks/009-muon-review.md—../../research/search_space.yaml(lives outside docs tree) replaced with a GitHub URL.mkdocs build --strictnow passes clean (only the unrelated MkDocs 2.0 framework deprecation banner remains). - Google Docs sync committed (PR #89) — 13 doc files refreshed (+937 lines across meeting notes, knowledge sprints, bigger-picture);
docs/references_auto.mdlink harvest +184 entries;docs/superpowers/plans/2026-03-16-egd-sparse-parity.md(March plan that was untracked); rootpackage-lock.jsonchecked in next to itspackage.json.
Pull requests landed¶
| PR | Title | Author |
|---|---|---|
| #84 | Add sqlite to devShell for telegram.db queries | @zh4ngx |
| #85 | docs: Task 11 — DeepSeek Engram offload ByteDMD verification | @zh4ngx |
| #87 | exp: ByteDMD floor-gap survey — KM-min 268 vs GF(2) 101,501 | @SethTS |
| #88 | exp: extend floor-gap survey with Fourier + SGD-demo + geometric LB | @0bserver07 |
| #89 | docs(sync): Google Docs refresh + untracked plans/lockfile cleanup | @0bserver07 |
| #91 | diagrams: single source of truth + bin/regen-diagrams + CI staleness check | @0bserver07 |
[0.28.0] - 2026-04-20¶
Issue sprint and repo hygiene¶
- 14 open issues closed against repo evidence in a single triage pass: #4, #6, #21, #24, #25, #26, #28, #29, #31, #58, #64, #67, #69, #71. Each close comment links to the file, PR, or finding that resolved it.
- Branch protection enabled on
main: 1 required approval, admin override, no force-push, no deletion. Closes #71. - 5 stale branches deleted after merge or intentional close:
YAD/adopt-bytedmd,yaroslav/tracked-numpy,v-alpha,YAD/telegram-sqlite,fix/claude-md-people-handles. Closes #69.
Parallel-agent sprint (PR #82, Yad)¶
bin/complexity-check: repo size + file-count snapshot, appends JSONL to.complexity-log.jsonl, flags >10% python-line growth. Closes #8.bin/score-all: discoverssolve_*.pysubmissions, runs each underbytedmd(), writes sorted TSV toresults/scoreboard.tsv. Closes #61 (the GF(2) undercount piece was moot post-ByteDMD).- LAB.md two-phase results section: evidence bundle in
results/<exp_id>/, narrative infindings/<exp_id>.md. Closes #27. - Onboarding pass: first-experiment walkthrough in
docs/getting-started.md, skills table,--dangerously-skip-permissionsexplanation, ByteDMD note,AGENT.mdredirect toLAB.mdfor humans. Closes #56. - Skills scaffolding:
examples/subdirs added to 4 skills missing them,references/subdirs on all 6 skills pointing at canonical docs (LAB.md,AGENT.md,DISCOVERIES.md,sync-runbook.md,telegram-setup.md,bytedmd.md). Closes #30. - Pip entry points:
[project.optional-dependencies](eval,modal,dev,all) and[project.entry-points."gymnasium.envs"]wired throughsparse_parity.eval:register_all. Closes #43. - Three new challenges:
majority-vote,threshold,noisy-parityinsrc/sparse_parity/challenges/using the existing registry pattern. LAB.md rule #9 respected (src/harness.pyuntouched).measure_*signatures reject unknown kwargs (no silent**kwargstypo swallowing). Closes #7.
Repo diagrams with pan/zoom (PR #83)¶
docs/research/repo-layout.md: Mermaid graph of the four-layer workspace (Docs / Code / Ops / Artifacts / Site) withsvg-pan-zoomcontrols (+ − ⤾ buttons, scroll to zoom, drag to pan).docs/research/repo-tree.md: interactive D3.js click-to-expand tree with natived3-zoomand the same control scheme.docs/diagrams/: dropped stale v1 agent-workflow files, promoted v2 to canonical.
Agent compatibility (PR #72, Andy)¶
GEMINI.mdadded for Gemini CLI / Antigravity agents, matching the existingCLAUDE.md/CODEX.md/AGENTS.mdset. Closes #13.AGENTS.mdcontext reference fix (PR #74).
NoProp + Forward-Forward follow-up (PR #79, Seth)¶
- Forward-Forward baseline added alongside the NoProp experiment. Denoising beats contrastive in this setting.
- Both experiments flagged as measured under legacy MemTracker, not ByteDMD — pending re-run once metrics stabilise.
ASI-Evolve research reports (PR #81)¶
- Three parallel-agent reports mapping the ASI-Evolve evolutionary framework onto SutroYaro's ByteDMD-optimized learning rule search.
- Agent prompts at
docs/agent-prompts/asi-evolve/{algorithms-claude, memory-kimi, execution-*}.md.
Muon review prompt (PR #73)¶
- Reusable review prompt for Muon-related contributions.
- Muon findings moved from
findings/to canonicaldocs/findings/path.
Dev-shell and environment¶
.envrcrestored for direnv auto-loading of the Nix devShell.nodejsadded to the devShell so Claude Code hooks can run.
Specs posted as follow-up comments¶
- #77 — ByteDMD microbenchmark spec for DeepSeek Engram's <3% SSD offload claim.
- #9 — Modal nanoGPT energy baseline spec:
bin/gpu_energy_nanogpt.py+ L4 GPU + nvidia-smi integration. - #14 — Post-task Telegram notify hook spec:
.claude/hooks/post-task-notify.cjsonStopevent, triple-gated. - #54 — Pikiclaw evaluation criteria before building a custom PR-review pipeline from scratch.
- #5 — Flagged
--challengeflag gap inbin/run-agentfor sparse-sum and sparse-and challenges.
[0.27.0] - 2026-04-14¶
ByteDMD adopted as primary metric¶
- Vendored cybertronai/ByteDMD at
src/bytedmd/ - 13 tests (10 core + 3 gotchas) all pass
- TrackedArray retained as legacy tracker for existing experiments (30 tests still pass)
- New docs page:
docs/research/bytedmd.md - Decision made by Yaroslav after meetings with Wesley Smith and Bill Dally feedback. Byte-granularity rewards smaller dtypes; pure Python eliminates numpy escape hatches.
- Existing challenge submissions stay on legacy DMC. New submissions use ByteDMD.
[0.26.0] - 2026-03-28¶
Largest release: auto-instrumented DMD tracking, RL eval environment, agent infrastructure, Telegram SQLite sync, and documentation fixes. 84 commits from 4 contributors.
Auto-instrumented DMD tracking (Yaroslav)¶
- TrackedArray: numpy ndarray wrapper that auto-tracks all operations without manual instrumentation. Wrap inputs, run unmodified code, read DMD.
- LRUStackTracker: true per-element LRU stack distances matching Ding et al. Definition 2.1. Writes place data on the stack (free). Only reads cost DMD = sqrt(stack_distance). No cold misses -- inputs arrive pre-loaded.
- GF(2) under-counting fixed: harness reported DMC 8,607 but actual DMD with all row operations tracked is ~203K. The leaderboard was wrong by 24x.
- Verified against known examples: paper example (abbbca, dist=3) and exact (a+b)+a prediction (DMD = 5.146).
- 30 tests organized by concern: wrapper mechanics, indexing, numpy functions, LRU metric, GF(2) integration.
- Docs:
research/tracked-numpy.mdwith worked examples and per-operation DMD breakdown.
RL evaluation environment (Yad, PR #49)¶
- Gymnasium environments:
SparseParity-v0(single challenge) andMultiChallenge-v0(all three challenges). Agent picks from 16 methods and gets scored on research quality. - 12-category grading rubric (72 points): checks if agent found GF(2), noticed ARD/DMC disagreement, observed local learning failing, explored broadly, found the algebraic solver efficiently.
- 16 methods all runnable: 5 via harness, 9 with live fallback, 2 cached.
- Platform adapters: Anthropic tool-use, PrimeIntellect verifiers, HuggingFace Spaces leaderboard, UK AISI Inspect.
- Registry system: add methods and challenges without editing environment code.
- Demo script, system overview page, eval docs.
Agent infrastructure (Yad, PR #50)¶
- 3 hooks: session-start (shows project status), security-guard (blocks edits to measurement code), session-end (session summary).
- 2 rules: experiment reproducibility (seeds, config, environment logging), agent coordination (file ownership, parallel dispatch criteria, post-merge guidance).
- 4 skills: run-experiment (two-phase protocol), weekly-catchup, prepare-meeting, info-defrag.
- LAB.md rules #10 (two-phase output) and #11 (reproducibility) added.
- Docs:
docs/research/agent-infrastructure.md, workflow diagram.
Telegram integration (Yad, Issue #58)¶
bin/tg-sync: incremental Telegram to SQLite sync. First run: full backfill. Subsequent runs: only new messages.bin/tg-post: post to forum topics via Bot API using per-person bot tokens. Safety guard: disabled by default, requiresTELEGRAM_POST_ENABLED=1.bin/tg-auth: standalone interactive MTProto login (replacestg auth logindependency).- SQLite schema: messages(id, topic_id, date, sender, text, reply_to). Setup guide:
docs/tooling/telegram-setup.md.
Documentation fixes (Yad, Issue #55)¶
- Experiment counts updated from 33/34 to 36 across 8 files.
- Seth Stafford's bio updated with GrokFast PRs.
- Timeline extended through March 24, system overview and catchup index updated.
[0.25.0] - 2026-03-24¶
GrokFast + Curriculum scaling frontier (PR #53, SethTS)¶
Maps how far GrokFast + Curriculum scales. k=3 scales effortlessly: n=200 solves in 11 epochs / 95ms, each expansion phase takes 1 epoch. k=5 hits a wall between n=50 and n=100 (60% solve rate, stalls at 94% after expansion). n=200/k=5 fails completely. The 5-way interaction detector learned at small n is too fragile to survive 50+ new noise dimensions.
- Findings:
findings/exp_grokfast_curriculum_scale.md
[0.24.0] - 2026-03-23¶
GrokFast + Curriculum compounding (PR #52, SethTS)¶
GrokFast and curriculum attack different axes (k-th order plateau vs n-scaling wall). Combined, they multiply:
- n=20/k=5: 5.8x speedup over SGD (12 epochs, 57ms)
- n=50/k=3: 8.3x speedup (7 epochs, 35ms)
- n=50/k=5: solves in 14 epochs / 77ms where SGD fails completely (0% at 1000 epochs)
Curriculum shields GrokFast from its noise-dimension weakness by keeping n small during the critical learning phase. 60 runs, 5 seeds each, 100% solve rate on all combined configurations.
- Findings:
findings/exp_grokfast_curriculum.md
[0.23.0] - 2026-03-23¶
GrokFast v2 experiment (PR #51, SethTS)¶
First external contribution. Seth tested GrokFast across 3 difficulty regimes (75 total runs, 5 seeds each):
- WIN on n=20/k=5: aggressive GrokFast (a=0.98, l=2.0) gives 2.5x fewer epochs (29 vs 73) and 2.3x faster wall time than SGD. The EMA accumulates the exponentially weak k-th order gradient signal.
- LOSS on n=30/k=3: 40% solve rate. More noise dimensions means the EMA amplifies noise.
- NEUTRAL on n=20/k=3: mild settings match SGD. Confirms exp4 finding that GrokFast is counterproductive when hyperparams are already correct.
The critical variable is interaction order (k), not input dimension (n). Mild GrokFast (a=0.95, l=1.0) was never worse than SGD across any regime.
- Findings:
findings/exp_grokfast_v2.md - Experiment:
src/sparse_parity/experiments/exp_grokfast_v2.py
[0.22.0] - 2026-03-22¶
DMC baseline sweep, optimization, and infrastructure (Issues #15-#22)¶
Headline: DMC rankings disagree with ARD. New best method found.
- DMC baseline sweep (#17): Measured all 5 methods across 3 challenges (sparse-parity, sparse-sum, sparse-and). 14 total runs. GF2 wins DMC on parity (8,607) despite KM winning ARD (92). Fourier's DMC is 78 billion (9M times worse than GF2).
- DMC optimization (#22): KM-min achieves DMC of 3,578 -- 58% lower than GF2 baseline. Single influence sample per bit suffices because parity influence is deterministic (exactly 0 or 1). Also discovered GF2's harness-measured DMC is artificially low (true DMC with fine-grained tracking: 189,056).
- fast.py tracker integration (#15): Added optional
trackerparameter to fast.py. Zero overhead when disabled. Reports ARD 7,210 / DMC 850,131 for default config. - Scoreboard backfill (#16): Filled DMC values for 21 of 35 scoreboard rows. 12 rows marked
needs_measurement. Addeddmc_sourcecolumn to distinguish measured vs estimated values. - DMC visualization (#18): Created
src/plot_dmc.pywith 3 plots: DMC-vs-ARD scatter, parity ranking bar chart, cross-challenge comparison. Output inresults/plots/. - Weekly catch-up section: Added
docs/catchups/with first entry (Mar 16-22). Covers Meeting #9 outcomes, Telegram activity, infrastructure inventory. - Meeting #9 notes synced: Added internal notes doc to Google Docs sync pipeline.
- Public Domain license (#20): Added Unlicense to repo root.
- 8 new GitHub issues created: #15-#22 covering DMC infrastructure, optimization, RL env prototype, license, and Mar 30 meeting prep.
Results¶
| Method | ARD | DMC | Rank shift |
|---|---|---|---|
| KM-min (new) | 20 | 3,578 | -- (new best) |
| GF2 | 420 | 8,607 | ARD #2, DMC #1 (was) |
| KM (5 samples) | 92 | 20,633 | ARD #1, DMC #2 |
| SGD | 8,504 | 1,278,460 | Same |
| Fourier | 11,980,500 | 78,140,662,852 | Same |
- Findings:
findings/exp_dmc_optimize.md,results/dmc_baseline_sweep.md - Plots:
results/plots/dmc_vs_ard.png,dmc_ranking_parity.png,dmc_cross_challenge.png
[0.21.0] - 2026-03-16¶
Egalitarian Gradient Descent experiment (Issue #4)¶
- Implemented EGD (arXiv:2510.04930): replaces gradient singular values with 1 via SVD, equalizing learning rates across all directions.
- CPU experiment (
exp_egd.py): EGD halves the grokking plateau. 14 epochs to 90% vs SGD's 33 (2.6x fewer). Solves in 21 vs 40. Both at lr=0.1. - GPU experiment (
gpu_egd.pyvia Modal L4): EGD is 12% slower in wall time despite 2x fewer epochs. SVD overhead per batch outweighs epoch savings. - Sparse sum comparison: SGD diverges at lr=0.1 (0/5 seeds), EGD solves 5/5. SVD normalization removes gradient magnitude, preventing scale-related divergence.
- Sub-10ms not achievable. Small hidden (50) is capacity-limited for both methods.
- Findings:
findings/exp_egd.md
[0.20.0] - 2026-03-15¶
SGD speed sweep and research hypotheses (Issue #4)¶
- Swept SGD configs: standard SGD floors at ~70-116ms (7 grokking epochs). Can't hit 10ms.
- Tested L-BFGS: 35-60ms. Faster but still needs many function evaluations.
- Tested Sign SGD: best single run 7.6ms (h=50, n=500, lr=0.1) but only 3/5 seeds solve. With n=1000 all 5 seeds solve at mean 29ms.
- Added 8 research hypotheses to TODO.md with paper references: EGD, Grokfast (corrected), GrokTransfer, warm start from GF(2), lottery ticket, higher weight decay, curriculum+EGD, L-BFGS.
- Deleted
findings/gpu_energy_baseline.md(contained inflated results from earlier pynvml run) - Cleaned up homepage and tooling page references
- Findings:
findings/exp_sgd_speed.md
[0.19.0] - 2026-03-14¶
GPU measurement via Modal Labs (Issue #6)¶
- Added
bin/gpu_energy.py: runs GF(2), SGD, KM on NVIDIA L4 via Modal using PyTorch CUDA (matching Yaroslav's gpu_toy.py approach) - Finding (5 runs): GPU is slower than CPU for all methods at n=20/k=3. SGD mean 1446ms GPU vs 142ms CPU (10x slower). KM mean 869ms vs 1.1ms (790x slower). GF(2) 2.0ms vs 0.5ms (4x slower). Tensors too small for CUDA overhead to amortize.
- The ARD vs DMC proxy comparison remains unanswered. Needs nanoGPT-scale workloads.
- Added
findings/_experiment_template.mdwith required sections (Question, What was performed, What was produced, Can it be reproduced, Finding) - Added integrity rules to AGENT.md (don't inflate results, classify honestly)
- Added scripts table to tooling overview page
- Updated README project structure
- Findings:
findings/exp_proxy_comparison.md
[0.18.0] - 2026-03-14¶
Reproduce-all script¶
- Added
bin/reproduce-all: runs all 14 experiments across 3 challenges, verifies results match baselines - Supports
--budget MSflag to skip experiments over a time budget (Yaroslav's "Spark 7 constraint": only run what fits in 1980s compute budgets) - All 14 experiments reproduce in 0.28 seconds. With
--budget 10(10ms), 6 pass, 8 skip, 0.08 seconds total - GF(2) on sum/and marked SKIP (expected fail, not a regression)
[0.17.0] - 2026-03-14¶
Three challenges, adding-a-challenge guide, Antigravity validation¶
- Added sparse sum (Challenge 2) and sparse AND (Challenge 3) to the harness
- Created
docs/research/adding-a-challenge.md: step-by-step guide for agents and contributors to add new tasks - Sparse AND was added by Google Antigravity (agent IDE) following the guide without human help, validating that the guide works for autonomous agents
- Harness now supports
--challengeflag:sparse-parity(default),sparse-sum,sparse-and - Sparse sum baselines: SGD 100% in 1 epoch (ARD 20), KM 100% (ARD 92)
- Sparse AND baselines: SGD 100% in 4 epochs, KM needs 20 samples (not 5) due to 1/2^(k-1) influence signal
- Updated search_space.yaml, questions.yaml, DISCOVERIES.md, TODO.md, baseline_check.py for both challenges
[0.16.0] - 2026-03-14¶
Repo migration, multi-topic Telegram sync, and skills¶
- Moved repo from
0bserver07/SutroYarotocybertronai/SutroYaro(updated 15 files, git remote, GitHub Pages URL) - Telegram sync now pulls 6 topics in priority order (chat-yad, chat-yaroslav, challenge #1, General, In-person, Introductions) to separate JSON files
- Added
sutro-syncskill: session-start routine for Telegram, Google Docs, GitHub checks - Added
sutro-contextskill: research context loader (DISCOVERIES.md, open questions, recent discussion) - Re-synced all 15 Google Docs with upstream changes
- Both of Andy's PRs merged (#2 TODO cleanup, #3 GF(2) noise experiment), Issue #1 closed
- Deploy workflow confirmed working on new org
[0.15.0] - 2026-03-11¶
Peer research protocol and autonomous agent infrastructure¶
Inspired by analysis of Karpathy's autoresearch and trevin-creator's Tiny-Lab, but built for multi-researcher use with our own naming conventions.
New files:
- AGENT.md -- machine-executable experiment loop for autonomous sessions
- src/harness.py -- locked evaluation harness (GF2, SGD, KM, Fourier, SMT) with CLI
- research/search_space.yaml -- bounded mutation space (16 methods, allowed parameter values)
- research/questions.yaml -- dependency graph of 12 research questions (9 resolved, 6 open)
- research/log.jsonl -- all 33 experiments from DISCOVERIES.md in machine-readable format
- results/scoreboard.tsv -- auto-generated leaderboard from log.jsonl
- results/progress.png -- ARD progress chart over experiment history
- checks/env_check.py -- pre-flight environment verification
- checks/baseline_check.py -- re-establish GF2/SGD/KM baselines per machine
- bin/run-agent -- tool-agnostic launcher with looped mode, circuit breaker, PID lock
- bin/merge-findings -- merge contributor log.jsonl entries via PR
- bin/analyze-log -- progress report and chart generation
- docs/research/peer-research-protocol.md -- full design doc with nanoGPT migration proposal
Design choices:
- Tool-agnostic: bin/run-agent --tool claude|gemini|custom works with any AI CLI
- Looped mode: multiple short cycles with fresh context, resilient to crashes
- Circuit breaker: halts if 5+ INVALID in last 20 experiments
- Harness integrity: SHA256 verified before and after each run
- All state in files: any tool that reads/writes files can participate
- Challenge-agnostic log schema: challenge field supports sparse parity now, nanoGPT later
- Researcher attribution: researcher field in log entries for peer merge
Updated files:
- CLAUDE.md -- added autonomous research infrastructure section, harness to isolation rule
- results/scoreboard.tsv -- generated from full 33-experiment log
[0.14.0] - 2026-03-11¶
Feedback tasks from Meeting #8¶
- Added Data Movement Complexity (DMC) metric to MemTracker (Ding et al., arXiv:2312.14441). DMC = sum of sqrt(stack_distance) for all float accesses. Baseline: ARD 4,104 / DMC 300,298.
- Confirmed stack distance already implemented (tracker clock advances by buffer size, not instruction count)
- Added metric isolation rule to LAB.md (rule #9): agents cannot modify measurement code
- Created task tracker in
docs/tasks/with 6 tasks from Meeting #8 feedback - Updated baselines table in LAB.md with DMC column
- Reproduced Germain's hidden=64 result: ARD drops 68% but ARD/float is identical (0.36). Smaller model, not better locality.
- Reviewed linear classifier paper (arXiv:2309.06979): CoT-based, not applicable to one-shot sparse parity benchmark
[0.13.0] - 2026-03-10¶
Meeting #8 docs and sync runbook¶
- Synced 6 new Google Docs from Meeting #8 (09 Mar 2026): notes, AI notes, Yaroslav knowledge sprint 2, Yaroslav GF(2) verification, Michael's Claude approach, The Bigger Picture roadmap
- Added all new docs to MkDocs nav with cross-reference headers
- Updated meetings index and notes pages with Meeting #8 summary
- Created sync runbook (
docs/tooling/sync-runbook.md) with weekly/daily/per-session checklists - Added GitHub PR/issue checking to CLAUDE.md "Before Pushing" checklist
- Total synced Google Docs: 15 (up from 9)
[0.12.0] - 2026-03-09¶
GF(2) noise tolerance experiment¶
- Added exp_gf2_noise experiment testing algebraic solver with label noise
- Key finding: Basic GF(2) fails at 1% noise; robust subset-sampling recovers up to 10-15%
- New experiment:
src/sparse_parity/experiments/exp_gf2_noise.py - Findings:
findings/exp_gf2_noise.md - Updated DISCOVERIES.md with noise tolerance results
[0.11.0] - 2026-03-07¶
Homepage and documentation refresh¶
- Rewrote homepage as a proper introduction (was jumping straight to "SOLVED")
- Updated context page with Phase 2 findings and timeline
- Synced 8 Google Docs with upstream changes (new Bookmarks section, Yaroslav link)
- Added "How to find things in Sutro Group" doc to sync config and nav
- Fixed sync script to preserve cross-reference headers across re-syncs
[0.10.0] - 2026-03-07¶
Review fixes and project documentation¶
- Fixed RL bandit tracker bug (stale loop variable
iinstead ofarm_idx) - Re-sorted TL;DR table by verdict tier then speed, ranked Target Propagation as #33
- Updated CLAUDE.md with Phase 2 results, best methods table, Telegram sync reference
- Added AGENTS.md documenting how 17 parallel Claude Code agents were used
- Added
.env.examplefor Telegram API credentials - Gitignored
messages.json(contains real group messages)
[0.9.0] - 2026-03-07¶
Phase 2: 17 experiments + Practitioner's Field Guide¶
Phase 2 experiments dispatched 17 independent Claude Code agents in parallel, each testing a different algorithmic approach:
- Algebraic/Exact: GF(2) Gaussian elimination (509 us, 240x faster than SGD), Kushilevitz-Mansour influence estimation (ARD 1,585, 724x better than Fourier), SMT backtracking
- Information-Theoretic: Mutual Information, LASSO, MDL Compression, Random Projections -- all solve it, none beats Fourier meaningfully
- Local Learning Rules: Hebbian, Predictive Coding, Equilibrium Propagation, Target Propagation -- all failed at chance level (parity requires k-th order interaction detection)
- Hardware-Aware: Tiled W1 (software ARD worsened), Pebble Game (2.2% energy savings), Binary Weights (fails at n=20)
- Alternative Framings: Genetic Programming (exact formula but doesn't scale), RL Bit Querying (ARD of 1 at inference), Decision Trees (greedy splitting can't find secret bits)
Practitioner's Field Guide (docs/research/survey.md): 4,500-word survey ranking all 33 experiments with decision framework, 10 generalized principles, and full AI research methodology.
Telegram sync tooling: sync_telegram.ts pulls messages from Sutro Group topic threads via MTProto. Full setup guide in docs/tooling/automation.md.
Updated DISCOVERIES.md, mkdocs nav (survey + 17 findings pages), and research index.
[0.8.0] - 2026-03-04¶
Blank-slate approaches (Round 3, no neural nets, no SGD)¶
Rounds 1-2 were incremental variations on the same MLP+SGD recipe. Round 3 reframes the problem: sparse parity as a search problem, not a learning problem.
For k=3 parity with secret indices {a,b,c}, the label is x[a] * x[b] * x[c]. You don't need a neural network. You need to search over C(n,k) possible subsets and test which one matches the data.
-
Fourier / Walsh-Hadamard solver: Sparse parity IS a Fourier coefficient. For each candidate k-subset S, compute
mean(y * product(x[:, S])). The true secret gives correlation ~1.0, everything else gives ~0. No training, no gradients, no iterations. Result: 13x faster than SGD (0.009s vs 0.12s for n=20/k=3), solves n=50/k=3 and n=20/k=5 trivially where SGD struggles. Only needs 20 samples. Scales to k=7 before combinatorial explosion (C(n,k) subsets). -
Evolutionary / random search: Randomly sample k-subsets, test if
product(x[:, subset])matches all labels. For n=20/k=3 it takes ~881 random tries (0.011s). Evolutionary search with mutation+crossover solves it in fewer evaluations but more wall time. Solves n=50/k=3 in 0.14s, a config SGD fails on entirely. -
Feature selection: Tried to decompose into "find the bits" then "classify." Exhaustive combo search works (178-1203x fewer ops than SGD). But the clever approaches fail: pairwise correlations are provably zero for parity (E[y * x_i * x_j] = 0 for ALL pairs, even correct ones). Greedy forward selection also fails. Parity has zero low-order statistical signatures. You must test the full k-way interaction. Neural nets need so many iterations because they implicitly search the combinatorial space via gradient descent.
When to use what: - k ≤ 7: Fourier/random search wins (milliseconds, exact, guaranteed) - k ≥ 10: C(n,k) explodes, SGD's implicit search via gradients becomes the only feasible path - k = 8-9: hybrid approaches may work (combinatorial search with pruning)
[0.7.0] - 2026-03-04¶
Research experiments (Round 2, autonomous agent team)¶
Round 2 explored variations on the working SGD solution: different optimizers, hyperparameter sweeps, and energy measurement improvements.
- Sign SGD: Replace gradient with its sign:
W -= lr * sign(grad). Normalizes gradient magnitudes, helping detect sparse features. Solves k=5 2x faster than standard SGD (7 vs 14 epochs). Standard SGD also solves k=5 with enough data (n_train=5000). The exp_d "failure" at 61.5% was insufficient data, not algorithm limits. - Curriculum learning: Train on easy configs first, then expand the network. n=10/k=3 (instant) → expand W1 with zero-padded columns → n=30/k=3 (1 epoch) → n=50/k=3 (1 epoch). Result: 14.6x speedup, cracks n=50 which direct training can't. The learned feature detector transfers because new columns start near zero.
- Cache-aware MemTracker: Built CacheTracker with LRU cache simulation to fix the broken ARD metric. Finding: L2 cache (256KB) eliminates ALL misses for both methods. Batch wins on total traffic (13% fewer floats, 16x fewer writes), not cache locality. Single-sample is more L1-friendly than batch.
- Weight decay sweep: Swept WD across [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0]. WD=0.01 already optimal. Only [0.01, 0.05] achieves 100% success rate. The effective LR*WD must be in [0.001, 0.005], an extremely narrow range.
- Per-layer + batch: Combined per-layer forward-backward with mini-batch SGD. Converges but not useful: single-sample SGD is 8x faster in epochs, and the per-layer re-forward pass adds 3.7x wall-time overhead.
[0.6.0] - 2026-03-04¶
Research experiments (Round 1, measuring energy on the working solution)¶
With 20-bit solved, round 1 asked: how energy-efficient is our solution, and can we improve it?
- Exp A: ARD on winning config: Measured energy proxy (ARD) on the working config. Per-layer forward-backward gives 3.8% improvement (17,299 vs 17,976 floats). W1 (the big weight matrix) dominates at 75% of all float reads, capping operation-reordering improvements at ~10%.
- Exp B: Batch ARD: Batch-32 shows 17x higher ARD in our metric, but does 16x fewer parameter writes. The ARD metric doesn't model cache. On real hardware where W1 fits in L2, batch would win. This exposed a limitation of our measurement tool.
- Exp C: Per-layer on 20-bit: Update each layer before proceeding to the next. Converges identically to standard backprop (99.5%, same epoch count). Free 3.8% ARD improvement with zero accuracy cost.
- Exp D: Scaling frontier: Mapped where standard SGD breaks. k=3 works to n~30-45. n=50/k=3 fails (54%). k=5 is impractical at any n (~200,000 epochs for n=20). The boundary is at n^k > 100,000 iterations, matching the theoretical SQ lower bound.
- Exp E: Forward-Forward: Hinton's local learning algorithm (no backward pass). Solves 3-bit but fails 20-bit (58.5%). Has 25x WORSE ARD than backprop, opposite of hypothesis. The locality advantage requires 10+ layer networks; our 2-layer MLP is too small to benefit.
- Exp F: Prompting strategies: Documented the meta-workflow: literature search → compare against published baselines → diagnose the gap → fix → verify. The most effective prompt was asking Claude to compare our hyperparams against Barak et al. 2022.
Speed¶
fast.py: numpy-accelerated training solves 20-bit in 0.12s average (220x faster than pure Python). hidden=200, n_train=1000, batch=32.
Infrastructure¶
- LAB.md: protocol for autonomous Claude Code experiment sessions
- DISCOVERIES.md: accumulated knowledge base from all experiments
_template.py: copy-and-modify experiment starter
[0.5.0] - 2026-03-04¶
Solving 20-bit sparse parity¶
The pipeline from v0.4.0 got 54% accuracy on 20-bit, a coin flip. Literature review of 6 papers diagnosed the gap.
The problem was hyperparameters, not the algorithm. Our LR=0.5 was 5x too high (literature uses 0.1), we used single-sample SGD instead of mini-batch (batch=32), and had too little training data (200 vs 500+ samples). One arxiv search fixed all of it.
- Exp 1: Fix Hyperparams: Changed LR 0.5→0.1, batch_size 1→32, n_train 200→500. Result: 99% accuracy with classic grokking, stuck at 50% for 40 epochs, then phase transition to 99% in ~10 epochs. The hidden progress metric ||w_t - w_0||_1 grew steadily throughout, confirming Barak et al. 2022's theory.
- Exp 4: GrokFast: Tested the EMA gradient filter from Lee et al. 2024 (amplify slow gradient components to accelerate grokking). Counterproductive. With correct hyperparams, baseline SGD hits 100% in 5 epochs (22.7s). GrokFast took 12 epochs and never reached 100%. Lesson: don't apply tricks designed for one regime (extended memorization) to another (fast convergence).
- Literature review: 6 papers (Barak 2022, Kou 2024, Merrill 2023, GrokFast, NTK grokking, SLT phase transitions)
- Research plan for autonomous experiment cycle
- Results organized into per-run directories with auto-generated index
[0.4.0] - 2026-03-03¶
Added¶
- Complete sparse parity pipeline (all 5 phases)
- 3 training variants: standard backprop, fused, per-layer
- MemTracker for ARD measurement
- JSON + markdown + plot output
- 20/20 tests passing
- Per-run results directories with index lookup
[0.3.0] - 2026-03-03¶
Added¶
- MkDocs site with Material theme
- Mermaid diagrams in context and findings
- Changelog tracking
[0.2.0] - 2026-03-03¶
Added¶
src/sync_google_docs.py: standalone script to sync Google Docs to markdown- Auto-extracted references (
docs/references_auto.md) - Expanded homework archive covering all 7 meetings
[0.1.0] - 2026-03-03¶
Added¶
- Initial research environment setup
- Converted 3 Google Docs to local markdown:
- Challenge #1: Sparse Parity (spec)
- Sutro Group Main (meeting index)
- Yaroslav Technical Sprint 1 (sprint log)
- Extracted 30+ hyperlinks into
docs/references.md - Directory structure: docs, findings, plans, research, src
CLAUDE.mdwith project context for AI assistantsCONTEXT.mdwith project background and timeline- Sprint 1 findings documented
- Sprint 2 plan drafted
- Meeting notes index with all external links
- Lectures and homework directories
.gitignorefor Python/Jupyter/IDE files