Tooling¶

How we use Claude Code as a research automation tool for the Sutro Group.

How It Works¶

Agent Workflow

The human writes specs (CLAUDE.md, DISCOVERIES.md, LAB.md). The lead agent reads those, surveys the problem space, then dispatches isolated sub-agents in parallel. Each sub-agent gets one approach, the experiment template, and shared modules. No sub-agent sees another's results. Outputs feed back into DISCOVERIES.md for the next round.

The Stack¶

Tool	What it does
Claude Code	AI coding agent in the terminal — runs experiments, writes findings, manages the MkDocs site
CLAUDE.md	Project instructions file that gives Claude context about the repo
Superpowers plugin	obra/superpowers — brainstorming, TDD, debugging, parallel agents, code review
Custom skills	Anti-slop guide, Ralph Wiggum
MCP servers	Extensible tool servers (Google Docs, browser, diagrams)
Anti-slop guide	Reference for eliminating AI writing patterns from prose
Automation scripts	`sync_google_docs.py` for pulling Google Docs, `sync_telegram.ts` for pulling Telegram threads, session trace export
Sync runbook	Weekly/daily/per-session checklists for keeping the knowledge base current

Scripts (`bin/`)¶

Script	What it does	Example
`bin/run-agent`	Autonomous experiment loop (any AI CLI)	`bin/run-agent --tool claude --max 10`
`bin/merge-findings`	Import contributor results via PR	`bin/merge-findings path/to/log.jsonl`
`bin/analyze-log`	Progress report from experiment log	`bin/analyze-log --plot`
`bin/reproduce-all`	Verify all 14 experiments in 0.28s	`PYTHONPATH=src python3 bin/reproduce-all --budget 10`
`bin/gpu_energy.py`	Real GPU energy via Modal Labs (L4)	`modal run bin/gpu_energy.py`

GPU Energy Measurement¶

bin/gpu_energy.py runs harness methods on an NVIDIA L4 via Modal Labs and samples actual power draw (watts) via pynvml during execution. Reports joules per method alongside ARD/DMC proxy metrics.

pip install modal
modal token set          # authenticate (one-time)
modal run bin/gpu_energy.py          # run all defaults
modal run bin/gpu_energy.py --json   # machine-readable

Cost: under $0.01 per run. See GPU vs CPU findings for results.

Reproduce All¶

bin/reproduce-all runs every method across all 3 challenges and verifies results match baselines. Supports a --budget flag (ms) to skip experiments over a time ceiling.

PYTHONPATH=src python3 bin/reproduce-all             # all 14 in 0.28s
PYTHONPATH=src python3 bin/reproduce-all --budget 10  # 6 pass in 0.08s

What Worked¶

The combination that produced 16 experiments in a few days:

CLAUDE.md as shared context — Every Claude Code session starts by reading the project state, findings, and working style rules
LAB.md as experiment protocol — Enforces one-hypothesis-per-experiment, baselines, and commit discipline
Anti-slop on all prose — Keeps documentation readable by humans, not just LLMs
Parallel agents — Multiple Claude Code instances running independent experiments simultaneously
Sub-2-second iteration — fast.py (numpy) keeps the feedback loop tight enough for hundreds of experiments per hour

What to Try Next¶

MCP servers for direct Google Docs access (currently using export URLs)
Hooks for auto-running experiments on file save
Custom skills for the Sutro research loop (literature search, hypothesis, experiment, measure)
Memory files for cross-session learning about what hyperparameters work