Peer Research Protocol¶
How SutroYaro runs autonomous, multi-researcher experiments on shared challenges.
Why this exists¶
At Meeting #8 (09 Mar 2026), four people independently ran agent-driven research on the same sparse parity challenge using different tools (Claude Code, Gemini CLI, Codex CLI, OpenCode, plain Python). Each produced valid results. One researcher's agents rewrote the measurement code to inflate scores instead of improving the algorithm.
This protocol solves three problems:
- Comparability: a locked evaluation harness means everyone measures the same way
- Accumulation: a machine-readable experiment log means results can be queried, merged, and built upon across researchers
- Autonomy: an agent-executable loop means any researcher can run overnight experiments without babysitting
Two layers¶
Layer 1: Single-agent loop¶
Each researcher (or their agent) runs experiments against a locked harness. The loop:
1. Read DISCOVERIES.md (what's known)
2. Read research/questions.yaml (what's open)
3. Pick top unchecked item from TODO.md
4. Design a single-variable experiment
5. Run against src/harness.py (locked, cannot modify)
6. Log result to research/log.jsonl
7. Classify: WIN / LOSS / INVALID / INCONCLUSIVE
8. Update TODO.md checkbox
9. Repeat until queue empty or interrupted
The agent-executable version lives in AGENT.md. The human protocol stays in LAB.md. Both produce the same output format.
Layer 2: Peer merge¶
Multiple researchers submit findings via PR. The merge process:
Yad's machine Germain's machine Yaroslav's machine
+------------------+ +------------------+ +------------------+
| AGENT.md loop | | Codex CLI | | Gemini CLI |
| log.jsonl (local)| | log.jsonl (local)| | log.jsonl (local)|
| harness.py (locked) | harness.py (locked) | harness.py (locked)
+--------+---------+ +--------+---------+ +--------+---------+
| | |
| PR | PR | PR
+----------+---------------+----------------------------+
v
+---------------------+
| Shared repo |
| DISCOVERIES.md | <-- merged findings
| research/log.jsonl | <-- merged experiment records
| results/scoreboard.tsv <-- auto-generated
| research/questions.yaml <-- updated
+---------------------+
Each researcher:
- Forks or branches the repo
- Runs their agent loop (any tool)
- Produces a local
research/log.jsonlusing the shared schema - Submits a PR with log entries + findings docs
bin/merge-findingsdeduplicates and integrates
The locked src/harness.py makes results comparable. Everyone measures ARD/DMC/timing the same way.
File map¶
AGENT.md # Agent-executable experiment loop (new)
LAB.md # Human protocol (unchanged)
DISCOVERIES.md # Shared knowledge base (unchanged)
CONTRIBUTING.md # How to submit results (unchanged)
TODO.md # Hypothesis queue with checkboxes (restructured)
src/
harness.py # Locked evaluation: ARD, DMC, timing, baselines
# Rule #9: agents CANNOT modify this file
research/
search_space.yaml # What the agent can vary, per challenge
questions.yaml # Dependency graph of open questions
log.jsonl # Append-only experiment records
results/
scoreboard.tsv # Human-readable leaderboard (from log.jsonl)
progress.png # ARD progress chart (from bin/analyze-log --plot)
checks/
env_check.py # Pre-flight: imports, fast.py, data generation
baseline_check.py # Re-establish baselines on this machine
bin/
run-agent # Launch autonomous cycle
merge-findings # Import contributor log entries
analyze-log # Progress report + chart generation
Running autonomously¶
Three modes, from simplest to most resilient. All are tool-agnostic (Claude Code, Gemini CLI, Codex CLI, OpenCode, or any CLI).
Single cycle -- one AI session, up to N experiments:
Looped mode (recommended for overnight) -- multiple short cycles with fresh context:
bin/run-agent --loop 10 --max 5 # 10 cycles, 5 experiments each
bin/run-agent --loop 20 --tool gemini # works with any AI CLI
The loop stops automatically when: - The hypothesis queue is empty - The circuit breaker trips (5+ INVALID in last 20) - The harness file was modified - Max cycles reached
Cron -- scheduled recurring runs:
After any run:
Tool compatibility¶
The system is designed to work with any AI tool that can read files, run Python, and write files. The bin/run-agent launcher handles the CLI differences.
| Tool | Headless? | How it works | Notes |
|---|---|---|---|
| Claude Code | Yes (claude -p) |
bin/run-agent --tool claude |
Full tool permissions via --allowedTools. 200 turn limit per cycle. |
| Gemini CLI | Yes (gemini -p --yolo) |
bin/run-agent --tool gemini |
--yolo skips confirmation prompts. Free tier: 60 req/min, 1K req/day. 1M token context. |
| Codex CLI | Yes (codex -q) |
bin/run-agent --tool codex |
OpenAI's agent. Included with ChatGPT Plus/Pro. Sandbox security built in. |
| OpenCode | Yes (arg) | bin/run-agent --tool opencode |
Open source, 75+ providers. No vendor lock-in. Use any model. |
| Antigravity | No (IDE only) | Open project manually, follow AGENT.md | agy opens the IDE. Use Manager View for parallel agents. Cannot be driven by bash loop. |
| Custom CLI | Depends | AI_CMD="my-tool -p" bin/run-agent --tool custom |
Any CLI that accepts a prompt on stdin or via flag. |
For GUI tools (Antigravity, Cursor), the workflow is manual:
- Open the project in the IDE
- Tell the agent: "Read AGENT.md. Follow its protocol. Your researcher ID is [name]."
- The agent reads the same files and follows the same loop
- Results go into the same log.jsonl and are mergeable via PR
The locked harness and file-based state mean results are comparable regardless of which tool produced them. That's the point.
The log.jsonl schema¶
Each line is one experiment. The schema is challenge-agnostic so it works for sparse parity now and nanoGPT later.
{
"id": "yad-034",
"researcher": "yad",
"date": "2026-03-11",
"challenge": "sparse-parity",
"hypothesis": "KM influence with 3 samples per bit instead of 5",
"method": "km",
"changed": {"influence_samples": 3},
"baseline": {"method": "km", "influence_samples": 5, "ard": 1585},
"result": {"ard": 2100, "dmc": 45000, "time_s": 0.004, "accuracy": 1.0},
"delta_pct": {"ard": 32.5},
"class": "LOSS",
"notes": "3 samples gives noisy estimates, wrong bit selected 20% of runs"
}
Field definitions:
id:{researcher}-{sequence}. Globally unique.researcher: who ran it (human or agent identity)challenge: which problem (sparse-parity,nanogpt, etc.)method: which algorithm familychanged: dict of what was varied from baseline. One key = one variable.baseline: the comparison point (method + params + metrics)result: measured metrics from the locked harnessdelta_pct: percentage change from baseline per metric. Positive = worse for minimization targets (ARD, time), better for maximization targets (accuracy).class: one ofWIN(improved primary metric),LOSS(worse),INVALID(couldn't run),INCONCLUSIVE(within noise),BASELINE(reference point)notes: free text, what was learned
The distinction between INVALID and LOSS matters. A crashed experiment is not a disproved hypothesis. INVALID means the experiment didn't produce valid measurements. LOSS means it ran correctly but the hypothesis was wrong.
search_space.yaml¶
Defines the bounded mutation space per challenge. The agent can only vary parameters listed here. To add a new parameter or method, a human edits the file and commits.
This prevents agents from introducing unbounded variation (new architectures, code rewrites, external packages) without human review.
Circuit breakers¶
Safety mechanisms borrowed from Tiny-Lab:
- INVALID rate: if 5 of the last 20 log entries are INVALID, halt and require human review
- Lock file:
research/.agent-lockprevents concurrent agent cycles. Stale locks (>2 hours) are cleaned up. - Max experiments per cycle: default 20. The agent stops after this many, even if hypotheses remain.
- Harness checksum:
bin/run-agentverifiessrc/harness.pyhasn't been modified before starting
Moving to a new challenge¶
When the group moves from sparse parity to nanoGPT (or any new problem):
- Add a new section to
research/search_space.yamlwithchallenge: nanogpt - Add new questions to
research/questions.yaml - Write a new harness function in
src/harness.py(or a separatesrc/harness_nanogpt.py) - Establish baselines (logged as
class: BASELINEentries in log.jsonl) - The rest of the infrastructure stays the same
See the "Adding nanoGPT" section at the bottom for a concrete proposal.
How this differs from autoresearch¶
Karpathy's autoresearch (Mar 2026) uses a similar single-agent loop pattern: one mutable file, one locked harness, one metric, git as ratchet. Our design differs in:
| Aspect | autoresearch | SutroYaro |
|---|---|---|
| Researchers | 1 | Multiple, via PR |
| Scope | Optimize one training script | Compare methods across families |
| Log format | TSV (local only) | JSONL (mergeable across researchers) |
| Git strategy | Ratchet (branch = best, discard = reset) | History (all experiments kept) |
| Agent protocol | program.md (one file) |
AGENT.md + search_space.yaml (structured) |
| Time budget | Fixed 5 min | Per-method (GF2 takes 500us, SGD takes 2min) |
| Challenge scope | Single (LLM pretraining) | Multi-challenge (sparse parity, then nanoGPT) |
We were already doing multi-researcher autonomous research before autoresearch dropped. This protocol formalizes what the group was already doing and adds machine-readable infrastructure.
Adding nanoGPT (proposal)¶
Yaroslav's three-axis roadmap (Meeting #8) defines the final exam: energy-efficient training of Karpathy's nanoGPT. Sparse parity is practice.
What changes¶
The core protocol stays identical. What changes per challenge:
Harness: src/harness.py gets a measure_nanogpt() function alongside measure_sparse_parity(). The nanoGPT harness measures:
- val_bpb (bits per byte on a fixed validation set, vocabulary-independent)
- ard (average reuse distance of one training step)
- dmc (data movement complexity)
- time_s (wall clock for fixed step count)
- peak_memory_mb
Search space: new block in search_space.yaml:
challenge: nanogpt
version: 1
methods:
- standard_sgd
- adam
- muon
- sign_sgd
- curriculum
- per_layer
parameters:
n_layers: [2, 4, 6, 8, 12]
n_heads: [2, 4, 8]
d_model: [64, 128, 256, 512]
lr: [0.0001, 0.0003, 0.001, 0.003]
batch_size: [4, 8, 16, 32]
context_length: [64, 128, 256]
weight_decay: [0.0, 0.01, 0.1]
warmup_steps: [0, 100, 500]
metrics:
primary: val_bpb
secondary: [ard, dmc, time_s, peak_memory_mb]
locked_in: src/harness.py
Baselines: run standard nanoGPT config, log as class: BASELINE, then the loop begins.
Training budget: fixed wall-clock (like autoresearch) makes sense here since model size varies. 5 minutes is reasonable for a tiny GPT on a single machine.
What transfers from sparse parity¶
The sparse parity campaign produced knowledge that directly applies:
- Per-layer updates work without accuracy loss (may reduce ARD for deeper nets)
- Curriculum learning transfers (train small, expand)
- W1 dominance likely transfers (embedding matrix will dominate in nanoGPT)
- Cache model matters more than raw ARD (L2 can eliminate misses for small models)
- Sign SGD converges faster in some regimes (worth testing on nanoGPT)
- Local learning rules fail on parity but may work on language (worth re-testing)
These become starting hypotheses for the nanoGPT queue. The research/questions.yaml would include dependencies like:
- id: nanogpt-Q0.1
question: What is the baseline val_bpb for standard nanoGPT in 5 minutes?
depends_on: []
resolved_by: []
- id: nanogpt-Q1.1
question: Does per-layer update reduce ARD on nanoGPT without hurting val_bpb?
depends_on: [nanogpt-Q0.1]
resolved_by: []
prior: "Per-layer gave 3.8% ARD improvement on sparse parity (exp_a, exp_c)"
Timeline¶
Moving to nanoGPT is a group decision. The protocol is ready whenever the group decides sparse parity has taught what it can teach. The three-axis roadmap suggests taking one step at a time: first improve process (this protocol), then improve metric (actual GPU measurement), then change problem (nanoGPT).