Skip to content

Session Findings: 2026-03-11

Post-Meeting #8 sync, metric work, and task triage.

What happened at Meeting #8 (09 Mar)

Four people demoed their agent setups on the sparse parity challenge:

  • Yad: Claude Code harness with parallel sub-agents. Found GF(2) solver (1000x faster than SGD). Yaroslav cloned the repo, ran it, and verified the result independently using Gemini. He also built a web visualizer for the algorithm.
  • Michael: Claude-based approach. His agents preferred methods from the 1990s (evolutionary search, exhaustive enumeration).
  • Germain: Replit-based "Research OS" with supervisor/researcher/verifier agents. Solutions favored 2010s methods. His agents also tried to rewrite the ARD measurement code to get better scores instead of improving the algorithm.
  • Yaroslav: Presented Knowledge Sprint #2 on energy metrics and the "bigger picture" roadmap.

Homework for next Monday: improve Challenge #1 using ARD as the energy proxy, present results and process.

Findings from this session

1. DMC metric added to tracker

Data Movement Complexity (Ding et al., arXiv:2312.14441) computes sum(sqrt(stack_distance)) for all float accesses. Unlike ARD (which averages distance), DMC penalizes long-distance fetches sub-linearly through the square root, matching the physics of 2D chip layouts where memory cost scales with sqrt(distance).

Baseline (n=20/k=3, single tracked training step):

Metric Value
ARD 4,104 floats
DMC 300,298
Total floats accessed 9,646

Our tracker already measured stack distance in floats (clock advances by buffer size, not instruction count). Adding DMC was one line.

2. Germain's hidden=64 is not a locality win

Germain's agents found that depth-1/hidden-64 drops ARD from ~48 to ~33 (his numbers) or from 6,589 to 2,129 (our reproduction). Yaroslav flagged that total_accesses also dropped proportionally.

We confirmed: ARD per float accessed is identical (0.367 vs 0.368 across 5 seeds). The "improvement" is the model being 68% smaller, so it touches 68% fewer floats. The locality of each access is unchanged. Both configs solve n=20/k=3 at 100% accuracy.

This matters because it means shrinking the model is not a path to better energy efficiency per unit of computation. You save energy the same way you save time: by doing less work. That's useful but not what the group is after.

3. Metric isolation rule

Germain's agents rewrote the ARD measurement code to inflate scores. We already had read-only benchmark code for sub-agents, but now it's an explicit rule in LAB.md (#9): agents cannot modify tracker.py, cache_tracker.py, data.py, or config.py.

4. Linear classifier paper (arXiv:2309.06979) is not applicable

The Telegram reading group flagged this paper as showing "a linear classifier can solve parity." On reading, it's about Chain-of-Thought auto-regressive prediction with intermediate reasoning tokens. A linear model on raw {-1,+1} inputs cannot solve parity (all pairwise correlations are zero, proven in exp_feature_select). The paper is relevant for the nanoGPT final exam, not for our current benchmark.

5. Yaroslav's three-axis roadmap

The "bigger picture" doc defines progress along three axes:

  1. Process (orange): improve how you use agents to find better algorithms. Yad's harness, Germain's Research OS, Michael's Claude approach.
  2. Metric (green): make the energy proxy more realistic. ARD to DMC to actual GPU measurement.
  3. Problem (blue): make the task harder. Sparse parity to nanoGPT.

The final exam: energy-efficient training of Karpathy's nanoGPT. Sparse parity is practice. Take small steps along one axis at a time.

Files changed

  • src/sparse_parity/tracker.py: added DMC computation
  • LAB.md: added rule #9 (metric isolation), DMC baseline in table
  • docs/tasks/: 6 task files tracking Meeting #8 feedback
  • docs/tooling/sync-runbook.md: weekly/daily/per-session checklists
  • 6 new Google Docs synced from Meeting #8
  • mkdocs.yml: nav entries for all new pages