Research¶
Research notes and literature review for the Sutro Group.
Research as Navigation¶
Research as Navigation is the thesis behind this project: research is primarily a navigation problem (finding the right question, method, comparison), and coding agents are the first tool that can navigate autonomously because they read state, execute experiments, write results, and loop. The page walks through this idea from ELI5 to PhD level, with examples from our 33 experiments.
Autonomous Research Infrastructure¶
Peer Research Protocol describes how SutroYaro runs autonomous, multi-researcher experiments. Multiple people use different AI tools (Claude Code, Gemini CLI, Codex CLI, OpenCode) on the same challenge. A locked evaluation harness ensures comparable results. A machine-readable experiment log accumulates findings across researchers.
Key infrastructure:
| Tool | What it does |
|---|---|
AGENT.md |
Machine-executable experiment loop for any AI agent |
src/harness.py |
Locked evaluation (5 methods, CLI). Agents cannot modify. |
bin/run-agent |
Tool-agnostic launcher with looped mode for overnight runs |
bin/analyze-log |
Progress report and chart from experiment log |
research/log.jsonl |
33 experiments in machine-readable format |
research/search_space.yaml |
What the agent can vary, per challenge |
The protocol is challenge-agnostic: it works for sparse parity now and nanoGPT later. See the full design doc for the nanoGPT migration proposal.
Survey¶
Sparse Parity: A Practitioner's Field Guide ranks all 33 experiments (16 Phase 1 + 17 Phase 2), provides a decision framework for picking methods, and documents the full AI research process including parallel agent dispatch.
Topics¶
- Sparse parity learning theory: literature review
- Average Reuse Distance: theory, measurement, and cache simulation
- Forward-Forward algorithm: tested, 25x worse ARD
- Sign SGD: solves k=5, 2x faster
- Per-layer forward-backward: 3.8% ARD improvement, converges identically
- Curriculum learning: 14.6x speedup on n=50
- Scaling frontier: SGD breaks at n^k > 100K
- Blank-slate approaches: Fourier, evolutionary, feature selection
- 17 proposed approaches: all completed, results in survey
- Deeper networks (5-10 layers) where FF's locality advantage may appear
- Hybrid approaches for k=8-9 (combinatorial search with pruning)
Main Finding¶
For small k, sparse parity is a search problem, not a learning problem
Fourier/random search over C(n,k) subsets is 13-178x faster than SGD for k ≤ 7. Neural nets only become necessary when k ≥ 10 and C(n,k) explodes.
Papers¶
| Paper | Year | Relevance | Link |
|---|---|---|---|
| Hidden Progress in Deep Learning (Barak et al.) | 2022 | SGD learns sparse parity via hidden Fourier gap | arxiv |
| Matching SQ Lower Bound with Sign SGD (Kou et al.) | 2024 | Theoretically optimal sparse parity solver | arxiv |
| A Tale of Two Circuits (Merrill et al.) | 2023 | Grokking = sparse vs dense subnetwork competition | arxiv |
| GrokFast (Lee et al.) | 2024 | EMA gradient filter, counterproductive in our regime | github |
| Feature Learning Dynamics under Grokking | 2024 | NTK eigenfunctions align with secret indices | openreview |
| Bill Daly - Energy in GPUs | 2024 | Memory cost dominates energy | YouTube |
| DMC4ML (Ding et al.) | 2023 | Data Movement Complexity for ML | arXiv |
| Demmel - Communication-Avoiding Algorithms | 2013 | Lower bounds on data movement | slides |
Other Resources¶
| Resource | Type | Link |
|---|---|---|
| Fitting Larger Networks into Memory | Article | Medium |
| Sparse Parity background | Notebook | NotebookLM |
| Sparse Parity Optimization | Slides | |
| Hinton's Forward-Forward | Paper + Discussion | Group notes |
| ARD Brainstorming | Gemini session | Session |
| parity-nn (minimal codebase) | GitHub | Tsili42/parity-nn |
Concepts¶
Average Reuse Distance (ARD)¶
Proxy metric for energy efficiency. Small ARD means data stays in fast cache. Large ARD means data must be fetched from external memory (HBM). Our CacheTracker extends this with LRU cache simulation for realistic estimates.
Data Movement Complexity (DMC)¶
Added in v0.14.0 based on Yaroslav's Knowledge Sprint #2. DMC = sum of sqrt(stack_distance) for all float accesses (Ding et al., arXiv:2312.14441). Unlike ARD (which averages), DMC penalizes long-distance fetches sub-linearly through the square root, matching the physics of 2D chip layouts. The LRU cache lemma guarantees LRU is within 2x of optimal, so our LRU-based tracker gives realistic estimates.
Baseline (n=20/k=3): ARD 4,104 / DMC 300,298.
The Roadmap¶
Yaroslav's bigger picture defines three axes: process (agent harnesses), metric (ARD to DMC to GPU), problem (sparse parity to nanoGPT). Take small steps along one axis at a time. The final exam is energy-efficient nanoGPT training.
The Giraffe Nerve Analogy¶
Backpropagation is like the recurrent laryngeal nerve in giraffes: it works but is inefficient because of the global memory access pattern. The brain uses ~20 Watts with local update rules. We want to find the AI equivalent.