Skip to content

Research

Research notes and literature review for the Sutro Group.

Research as Navigation

Research as Navigation is the thesis behind this project: research is primarily a navigation problem (finding the right question, method, comparison), and coding agents are the first tool that can navigate autonomously because they read state, execute experiments, write results, and loop. The page walks through this idea from ELI5 to PhD level, with examples from our 33 experiments.

Autonomous Research Infrastructure

Peer Research Protocol describes how SutroYaro runs autonomous, multi-researcher experiments. Multiple people use different AI tools (Claude Code, Gemini CLI, Codex CLI, OpenCode) on the same challenge. A locked evaluation harness ensures comparable results. A machine-readable experiment log accumulates findings across researchers.

Key infrastructure:

Tool What it does
AGENT.md Machine-executable experiment loop for any AI agent
src/harness.py Locked evaluation (5 methods, CLI). Agents cannot modify.
bin/run-agent Tool-agnostic launcher with looped mode for overnight runs
bin/analyze-log Progress report and chart from experiment log
research/log.jsonl 33 experiments in machine-readable format
research/search_space.yaml What the agent can vary, per challenge

The protocol is challenge-agnostic: it works for sparse parity now and nanoGPT later. See the full design doc for the nanoGPT migration proposal.

Survey

Sparse Parity: A Practitioner's Field Guide ranks all 33 experiments (16 Phase 1 + 17 Phase 2), provides a decision framework for picking methods, and documents the full AI research process including parallel agent dispatch.

Topics

Main Finding

For small k, sparse parity is a search problem, not a learning problem

Fourier/random search over C(n,k) subsets is 13-178x faster than SGD for k ≤ 7. Neural nets only become necessary when k ≥ 10 and C(n,k) explodes.

Papers

Paper Year Relevance Link
Hidden Progress in Deep Learning (Barak et al.) 2022 SGD learns sparse parity via hidden Fourier gap arxiv
Matching SQ Lower Bound with Sign SGD (Kou et al.) 2024 Theoretically optimal sparse parity solver arxiv
A Tale of Two Circuits (Merrill et al.) 2023 Grokking = sparse vs dense subnetwork competition arxiv
GrokFast (Lee et al.) 2024 EMA gradient filter, counterproductive in our regime github
Feature Learning Dynamics under Grokking 2024 NTK eigenfunctions align with secret indices openreview
Bill Daly - Energy in GPUs 2024 Memory cost dominates energy YouTube
DMC4ML (Ding et al.) 2023 Data Movement Complexity for ML arXiv
Demmel - Communication-Avoiding Algorithms 2013 Lower bounds on data movement slides

Other Resources

Resource Type Link
Fitting Larger Networks into Memory Article Medium
Sparse Parity background Notebook NotebookLM
Sparse Parity Optimization Slides PDF
Hinton's Forward-Forward Paper + Discussion Group notes
ARD Brainstorming Gemini session Session
parity-nn (minimal codebase) GitHub Tsili42/parity-nn

Concepts

Average Reuse Distance (ARD)

Proxy metric for energy efficiency. Small ARD means data stays in fast cache. Large ARD means data must be fetched from external memory (HBM). Our CacheTracker extends this with LRU cache simulation for realistic estimates.

Data Movement Complexity (DMC)

Added in v0.14.0 based on Yaroslav's Knowledge Sprint #2. DMC = sum of sqrt(stack_distance) for all float accesses (Ding et al., arXiv:2312.14441). Unlike ARD (which averages), DMC penalizes long-distance fetches sub-linearly through the square root, matching the physics of 2D chip layouts. The LRU cache lemma guarantees LRU is within 2x of optimal, so our LRU-based tracker gives realistic estimates.

Baseline (n=20/k=3): ARD 4,104 / DMC 300,298.

The Roadmap

Yaroslav's bigger picture defines three axes: process (agent harnesses), metric (ARD to DMC to GPU), problem (sparse parity to nanoGPT). Take small steps along one axis at a time. The final exam is energy-efficient nanoGPT training.

The Giraffe Nerve Analogy

Backpropagation is like the recurrent laryngeal nerve in giraffes: it works but is inefficient because of the global memory access pattern. The brain uses ~20 Watts with local update rules. We want to find the AI equivalent.