Cross-references

Source: Google Doc · Meeting #8 summary Key concept: 3-axis cube (process, metric, problem). Final exam: energy-efficient nanoGPT training. Related: Challenge #1 · Yaroslav Knowledge Sprint 2

Yaroslav Sutro planning sprint #1

Objective

Solve the energy problem of AI. Similar mission to [https://ml.energy/] .

Method

Blank slate rewrite. Draw on [experience] to prevent mistakes.

Thoughts

([planning-sprint1.txt], [planning-sprint2.txt], [plannint-spring3.txt], [gemini])

Yaroslav Sutro planning sprint #1

Objective

Solve the energy problem of AI. Similar mission to [https://ml.energy/] .

Method

Blank slate rewrite. Draw on [experience] to prevent mistakes.

Raw Thoughts

([planning-sprint1.txt], [planning-sprint2.txt], [plannint-spring3.txt], [gemini])

Yaroslav Sutro planning sprint #1

Objective

Solve the energy problem of AI. Similar mission to [https://ml.energy/] .

Method

Blank slate rewrite. Draw on [experience] to prevent mistakes.

Raw Thoughts

([planning-sprint1.txt], [planning-sprint2.txt], [plannint-spring3.txt], [gemini])

Mistakes¶

Too simple of an objective
Suboptimization
Risk mismanagement
Building for the wrong future

[embedded image]

Wrong objective¶

Real life objectives are hard to measure, so we substitute simpler versions. If it's \"too simple\" we get Surrogate Endpoint Fallacy. Example: Google Radiology failure (mit [review], Andrew Ng's [thebatch]). Other examples ([chatgpt], [deepthink]).

Unlike radiology, we don't need clinical trials, so it's easier to test the real objective periodically. Energy dissipates as heat (computer has 100% heating efficiency [experiment]), so we can estimate energy usage by checking if the device is warm to the touch. Phones and laptops have GPUs, sanity check could be to compare two algorithms on a consumer device and checking the temperature manually.

Risk management¶

Going too far without positive feedback risks losing trust or [sanity].

Vincent post ([medium], [archive]), Terry Tao mathematical [disease].

Suboptimization¶

Once you partition the problem space, optimizing parts can either get stuck or make the overall objective worse.

Examples:

Train / test. (mixture of experts made training more efficient, inference less efficient)

Math / kernels design.

Hardware lottery (sarah [hooker]), machine learning systems are [stuck in a rut] (Paul Barham, Michael Isard)

To avoid -- remove as many boundaries as possible. At some point, the problem becomes too big to make progress on, we are forced to partititon to make progress. If we are motivated by keeping the complexity manageable, we can periodically re-adjust the partitioning. In contrast to the current paradigm which keeps problem partitioning fixed:

math people vs kernels people
training improvement vs inference improvements
optimizer design vs architecture design

Building for the wrong future¶

Building software for the wrong hardware can take a while to fix. Python still uses GIL, 20 years after multiple cores were introduced. B-tree/LST tree transition due to HDD->SSH transition was 10 years delayed ([notebook]).

The field of deep learning settled on the core algorithms in the 1980s when we had 1 thread and computation was the primary bottleneck. Today, the bottleneck is memory.

Designing for algorithms that optimize memory-energy will work for today's hardware, but could be obsolete if a different paradigm takes over.

The memory wall trend makes this unlikely.
Can be mitigated by focusing on the process to obtain the algorithm rather than algorithm itself

Checking for \"signs of life\"¶

Demonstrate energy-efficient training of Karpathy's [nanoGPT]. To be convincing, the result must improve in energy usage without sacrificing other important metrics, such as wall-clock time and accuracy.

This can be viewed as a final exam, it must be passed in order to proceed to \"the next class\".

Method¶

Get from A to B in the following diagram

[embedded image]

This diagram is a cube with three orthogonal directions

Orange: Improve the process (optimizing given metric for a given problem)
Green: Make the metric more realistic.
Blue: Make the problem more realistic.

To keep complexity manageable, take small steps on axis at a time.

Examples

The Process

Yaroslav: [Yaroslav Sutro technical sprint #1 02mar26], [Yaroslav Verification Sprint #1]

Yad: [Setup Sutro Yaro with Claude Code | Sparse Parity] [survey] [github]

Michael: [sutro_challenge_3_sparce parity results.docx]

The Metric [Yaroslav Knowledge sprint #1]
The Problem [sutro group challenge #1: sparse parity]