Sprint 1 Findings¶
Historical, Sprint 1 of 3
This was the first sprint. All candidate algorithms recommended here (Forward-Forward, per-layer updates, local learning) have since been implemented and tested. See Exp E (FF: 25x worse ARD), Exp C (per-layer: 3.8% improvement), and the changelog for the full story.
Date: 02 Mar 2026 | Duration: 2.5 hours
Summary¶
Claude identified a gradient fusion strategy that improves energy efficiency slightly. Cache reuse improved by 16%. A larger improvement requires a different learning algorithm.
Setup¶
- 3-bit parity task, tiny neural network
- Pure Python implementation (no PyTorch overhead)
- <1 second total runtime constraint
- Tools: Claude Code, Gemini, Colab
Process¶
flowchart LR
A[Prototype 3-bit\nparity network] --> B[Make it fast\n< 1 second]
B --> C[Add ARD\nmetrics]
C --> D[Ask Claude to\nimprove ARD]
D --> E[Gradient fusion\nidentified]
E --> F[16% improvement\nbut bottleneck remains]
Finding: ARD Bottleneck in Backprop¶
The Bottleneck
Parameter tensors are read twice with the entire forward+backward pass in between. Gradient fusion only addresses 5% of total memory reads.
| Buffer | Floats Read | % of Total | Reuse Distance | Changed? |
|---|---|---|---|---|
| W1 | 6,000 | 32% | ~15,000 | No |
| b1 | 2,000 | 11% | - | No |
| dW2 | 1,001 | 5% | 16,005 -> 3,002 | Yes |
| db2 | 1,001 | 5% | 18,005 -> 5,002 | Yes |
The improved buffers (dW2, db2) contribute only 1,001 floats out of 19,013 total -- just 5% of the weighted sum.
Conclusion¶
Gradient fusion fixed the easy wins (gradient buffers), but the bottleneck is W1 and b1 being read in forward and again at the end of backward. To fix that, you need:
- Per-layer forward-backward -- compute Layer 1's backward and update before Layer 2's forward (changes the math)
- Forward-Forward algorithm -- no backward pass at all
Artifacts¶
- Code: cybertronai/sutro
- Benchmark: sparse_parity_benchmark.py
- Colab: fast version
- Gemini sessions: runtime estimation, ARD brainstorming, ARD discussion