Skip to content

Sprint 1 Findings

Historical, Sprint 1 of 3

This was the first sprint. All candidate algorithms recommended here (Forward-Forward, per-layer updates, local learning) have since been implemented and tested. See Exp E (FF: 25x worse ARD), Exp C (per-layer: 3.8% improvement), and the changelog for the full story.

Date: 02 Mar 2026 | Duration: 2.5 hours

Summary

Claude identified a gradient fusion strategy that improves energy efficiency slightly. Cache reuse improved by 16%. A larger improvement requires a different learning algorithm.

Setup

  • 3-bit parity task, tiny neural network
  • Pure Python implementation (no PyTorch overhead)
  • <1 second total runtime constraint
  • Tools: Claude Code, Gemini, Colab

Process

flowchart LR
    A[Prototype 3-bit\nparity network] --> B[Make it fast\n< 1 second]
    B --> C[Add ARD\nmetrics]
    C --> D[Ask Claude to\nimprove ARD]
    D --> E[Gradient fusion\nidentified]
    E --> F[16% improvement\nbut bottleneck remains]

Finding: ARD Bottleneck in Backprop

The Bottleneck

Parameter tensors are read twice with the entire forward+backward pass in between. Gradient fusion only addresses 5% of total memory reads.

Buffer Floats Read % of Total Reuse Distance Changed?
W1 6,000 32% ~15,000 No
b1 2,000 11% - No
dW2 1,001 5% 16,005 -> 3,002 Yes
db2 1,001 5% 18,005 -> 5,002 Yes

The improved buffers (dW2, db2) contribute only 1,001 floats out of 19,013 total -- just 5% of the weighted sum.

Conclusion

Gradient fusion fixed the easy wins (gradient buffers), but the bottleneck is W1 and b1 being read in forward and again at the end of backward. To fix that, you need:

  1. Per-layer forward-backward -- compute Layer 1's backward and update before Layer 2's forward (changes the math)
  2. Forward-Forward algorithm -- no backward pass at all

Artifacts