Sprint 1 Findings¶

Historical, Sprint 1 of 3

This was the first sprint. All candidate algorithms recommended here (Forward-Forward, per-layer updates, local learning) have since been implemented and tested. See Exp E (FF: 25x worse ARD), Exp C (per-layer: 3.8% improvement), and the changelog for the full story.

Date: 02 Mar 2026 | Duration: 2.5 hours

Summary¶

Claude identified a gradient fusion strategy that improves energy efficiency slightly. Cache reuse improved by 16%. A larger improvement requires a different learning algorithm.

Setup¶

3-bit parity task, tiny neural network
Pure Python implementation (no PyTorch overhead)
<1 second total runtime constraint
Tools: Claude Code, Gemini, Colab

Process¶

flowchart LR
    A[Prototype 3-bit\nparity network] --> B[Make it fast\n< 1 second]
    B --> C[Add ARD\nmetrics]
    C --> D[Ask Claude to\nimprove ARD]
    D --> E[Gradient fusion\nidentified]
    E --> F[16% improvement\nbut bottleneck remains]

Finding: ARD Bottleneck in Backprop¶

The Bottleneck

Parameter tensors are read twice with the entire forward+backward pass in between. Gradient fusion only addresses 5% of total memory reads.

Buffer	Floats Read	% of Total	Reuse Distance	Changed?
W1	6,000	32%	~15,000	No
b1	2,000	11%	-	No
dW2	1,001	5%	16,005 -> 3,002	Yes
db2	1,001	5%	18,005 -> 5,002	Yes

The improved buffers (dW2, db2) contribute only 1,001 floats out of 19,013 total -- just 5% of the weighted sum.

Conclusion¶

Gradient fusion fixed the easy wins (gradient buffers), but the bottleneck is W1 and b1 being read in forward and again at the end of backward. To fix that, you need:

Per-layer forward-backward -- compute Layer 1's backward and update before Layer 2's forward (changes the math)
Forward-Forward algorithm -- no backward pass at all

Artifacts¶

Code: cybertronai/sutro
Benchmark: sparse_parity_benchmark.py
Colab: fast version
Gemini sessions: runtime estimation, ARD brainstorming, ARD discussion