Research¶

Research notes and literature review for the Sutro Group.

Research as Navigation is the thesis behind this project: research is primarily a navigation problem (finding the right question, method, comparison), and coding agents are the first tool that can navigate autonomously because they read state, execute experiments, write results, and loop. The page walks through this idea from ELI5 to PhD level, with examples from our 33 experiments.

Autonomous Research Infrastructure¶

Peer Research Protocol describes how SutroYaro runs autonomous, multi-researcher experiments. Multiple people use different AI tools (Claude Code, Gemini CLI, Codex CLI, OpenCode) on the same challenge. A locked evaluation harness ensures comparable results. A machine-readable experiment log accumulates findings across researchers.

Key infrastructure:

Tool	What it does
`AGENT.md`	Machine-executable experiment loop for any AI agent
`src/harness.py`	Locked evaluation (5 methods, CLI). Agents cannot modify.
`bin/run-agent`	Tool-agnostic launcher with looped mode for overnight runs
`bin/analyze-log`	Progress report and chart from experiment log
`research/log.jsonl`	33 experiments in machine-readable format
`research/search_space.yaml`	What the agent can vary, per challenge

The protocol is challenge-agnostic: it works for sparse parity now and nanoGPT later. See the full design doc for the nanoGPT migration proposal.

Survey¶

Sparse Parity: A Practitioner's Field Guide ranks all 33 experiments (16 Phase 1 + 17 Phase 2), provides a decision framework for picking methods, and documents the full AI research process including parallel agent dispatch.

Topics¶

Main Finding¶

For small k, sparse parity is a search problem, not a learning problem

Fourier/random search over C(n,k) subsets is 13-178x faster than SGD for k ≤ 7. Neural nets only become necessary when k ≥ 10 and C(n,k) explodes.

Papers¶

Paper	Year	Relevance	Link
Hidden Progress in Deep Learning (Barak et al.)	2022	SGD learns sparse parity via hidden Fourier gap	arxiv
Matching SQ Lower Bound with Sign SGD (Kou et al.)	2024	Theoretically optimal sparse parity solver	arxiv
A Tale of Two Circuits (Merrill et al.)	2023	Grokking = sparse vs dense subnetwork competition	arxiv
GrokFast (Lee et al.)	2024	EMA gradient filter, counterproductive in our regime	github
Feature Learning Dynamics under Grokking	2024	NTK eigenfunctions align with secret indices	openreview
Bill Daly - Energy in GPUs	2024	Memory cost dominates energy	YouTube
DMC4ML (Ding et al.)	2023	Data Movement Complexity for ML	arXiv
Demmel - Communication-Avoiding Algorithms	2013	Lower bounds on data movement	slides

Other Resources¶

Resource	Type	Link
Fitting Larger Networks into Memory	Article	Medium
Sparse Parity background	Notebook	NotebookLM
Sparse Parity Optimization	Slides	PDF
Hinton's Forward-Forward	Paper + Discussion	Group notes
ARD Brainstorming	Gemini session	Session
parity-nn (minimal codebase)	GitHub	Tsili42/parity-nn

Concepts¶

Average Reuse Distance (ARD)¶

Proxy metric for energy efficiency. Small ARD means data stays in fast cache. Large ARD means data must be fetched from external memory (HBM). Our CacheTracker extends this with LRU cache simulation for realistic estimates.

Data Movement Complexity (DMC)¶

Added in v0.14.0 based on Yaroslav's Knowledge Sprint #2. DMC = sum of sqrt(stack_distance) for all float accesses (Ding et al., arXiv:2312.14441). Unlike ARD (which averages), DMC penalizes long-distance fetches sub-linearly through the square root, matching the physics of 2D chip layouts. The LRU cache lemma guarantees LRU is within 2x of optimal, so our LRU-based tracker gives realistic estimates.

Baseline (n=20/k=3): ARD 4,104 / DMC 300,298.

The Roadmap¶

Yaroslav's bigger picture defines three axes: process (agent harnesses), metric (ARD to DMC to GPU), problem (sparse parity to nanoGPT). Take small steps along one axis at a time. The final exam is energy-efficient nanoGPT training.

The Giraffe Nerve Analogy¶

Backpropagation is like the recurrent laryngeal nerve in giraffes: it works but is inefficient because of the global memory access pattern. The brain uses ~20 Watts with local update rules. We want to find the AI equivalent.