Skip to content

Cross-references

Source: Google Doc · Meeting #8 summary Related: Sprint 1 · The Bigger Picture

Take-aways

Cache lemma means LRU cache is within 2x factor of optimal cache. Because optimal caching strategy is hard to compute, just assume LRU cache for everything. Continuous LRU cache and and data movement complexity of Ding are promising ([paper])

ARD is likely a sufficiently good metric. It's impossible to drive ARD to zero without improving energy efficiency of algorithms. Data Movement Cost is likely a better metric, by virtue of being known in the literature, and directly corresponding to energy cost for a specific kind of physical 2D memory layout.

Notes

Motivation: We have used average reuse distance before but I had some doubts about how interesting it is. I want to avoid metrics which, when optimizing, could create an algorithm with absolutely no relevance either to the physical world or to the literature world (meaning scientific literature). I went on a fishing expedition to remind myself how energy works.

Now we have a hierarchy of caches but there is a great simplification which is known as the square root rule. Basically the size of the cache and the cost, both in terms of latency and in terms of energy, can be approximated as the square root of the square root of the size of the cache. The reason for this can be tracked down to basic physics. If you imagine you have something of size k, the size of the wire to access that cache is, it might be, k squared, as long as you lay it out on a two-dimensional grid.

The second great simplification is that the last recently used cache is within a factor of 2 of the optimal cache. An optimal cache, imagine if you're using some piece of data and then you have a capacity conflict; you might evict this piece of data if it has been used before. An optimal cache can see the future and it would just say, \"Hey don't evict it cause we might use it.\" Because of the simplification we just can assume we use the last recently used cache.

With these two simplifications we can use a much simpler heuristic. The reason it is interesting is because:

  1. Conceptually an algorithm that optimizes this could be built on chip. We just use wires to collect the rings and concentric circles and that's what will be the energy cost that we will have.

  2. The second reason it is interesting is because it connects to existing literature of DING.

-----

Goal is to get all the relevant background to double  check on the right proxy metric. I used Average Reuse Distance, check if it's a good metric. [notability]

Ultimate metric is energy efficiency on a common GPU like H100, However, iterating on an actual GPU is too slow, so we need to find a simpler proxy. Understand better the cache issue and iterate on various proxy metrics.

Research GPU availabilities and costs: [https://chatgpt.com/share/69ac8b16-3e54-8011-8b4c-9b76a357da9e]

  • morning, Iterate on understanding of proxies in WebGPUmode ([gemini])

  • Summarize main concepts in \"energy proxy metrics\" [notability]

- kinetic locality Reflects the energy needed to keep memory resident between instructions. Probably negligible, so ignore that for now.

- Ultimate metric appears to be the reuse distance histogram, but that's not a single number.

- Average Reuse Distance remains important.

  • look into working set size and ARD in WebGPUmode ([gemini])

Videos:

[Herb Sutter @ NWCPP: Machine Architecture: Things Your Programming Language Neve] ([slides])

[code::dive conference 2014 - Scott Meyers: Cpu Caches and Why You Care] ([slides])

[14. Caching and Cache-Efficient Algorithms]

Herb Sutter

  • Herb Sutter video: ground truth of memory wall

slides: [https://nwcpp.org/talks/2007/Machine_Architecture_-_NWCPP.pdf]

video: [Herb Sutter @ NWCPP: Machine Architecture: Things Your Programming Language Neve]

[notebook

(99% of software complexity goes into hiding latency, due to Memory Wall. Little Law)

[https://nwcpp.org/talks/2007/Machine_]

[embedded image]

[https://youtu.be/xDKnMXtZKq8?si=atySjQ79DdWimyGn]

Scott Meyers: CPU Caches

[code::dive conference 2014 - Scott Meyers: Cpu Caches and Why You Care]

[embedded image]

Danfu TogetherAI kernels course

danfu tutorials - [together-kernels](https://drive.google.com/drive/folders/1tnqtk6J9xNEUJ9AR6wwn4Gx1_IBxSMqC)

12:34 -- Plan now is to look at the MIT course and think about the metric again.

6.172, Performance Engineering of Software Systems

[notebook] [course]

lecture 14

[https://www.youtube.com/watch?v=xDKnMXtZKq8&t=329s]

[https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/329bfc6e1808c375afa517feb3c4c273_MIT6_172F18_lec14.pdf]

LRU lemma means we can assume LRU model

[embedded image]

lecture 15

[https://www.youtube.com/watch?v=xwE568oVQ1Y&t=96s]

[https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/cef17369f91d3140409f2be4ad9246a4_MIT6_172F18_lec15.pdf]

Log

12:58 -- How about checking the energy cost in Bill Daily [notebook]

13:04 - Bill Daly energy research [report]

13:04 -- MIT course talks about cache lines, but modern GPUs use thread memory coalescing. (pnpp [notebook])

TLDR; LRU replacement policy is within 2x of optimal

[embedded image]

Stephen Jones: How GPU computing works

[Notebook]

[https://www.youtube.com/watch?v=3l10o0DYJXg]

calculates how many threads are needed to keep cores utilized, based on latency+bandwidth numbers

5x oversubscription for A100

James Demmel, Communication-avoiding algorithms

[notebook]

[https://simons.berkeley.edu/sites/default/files/docs/827/demmelslides.pdf]

[James Demmel: Communication-Avoiding Algorithms for Linear Algebra, Machine Learning and Beyond]

[Communication Avoiding Algorithms for Linear Algebra and Beyond]

[embedded image]

Research communication avoid algorithms for learning

[https://chatgpt.com/c/69acbb1e-4e74-8322-a0fb-349e5961cf40]

Ranking heuristics for LRU cache reuse

16:22 Ask about other heuristics [gemini] [notebook]

must rely on bytes touched (Stack Distance) rather than instructions elapsed (Reuse Distance)

  • Cache Complexity (requires cache-size B)

[embedded image]

  • Working Set Size (requires execution window T)

  • Reuse Distance Profile (a histogram of reuse distances)

Logarithmic Area Under the Curve (Log-AUC)

The Harmonic Mean of Stack Distance (or Stack Distance)

Follow-ups to Demmel

[https://chatgpt.com/c/69acc320-d2d8-8322-be45-2b5d6dc71997], ([shared]) [gemini] ([shared])

Average

Reuse Distance (count instructions between accesses)

Stack Distance

[embedded image]

Smooth hardware-agnostic heuristic -- [https://chatgpt.com/c/69accbbc-c260-8321-9078-a8d31d0da609]

Energy roofline (arch line) -- [https://perso.ens-lyon.fr/christophe.alias/evalM2/choi2013-archline-ipdps.pdf]

Follow-up paper - [https://arxiv.org/abs/2509.20189]

Analyze the paper and consider relevant heuristics -- [https://chatgpt.com/c/69acce0b-97ac-8324-b82e-31af148de335]

Sunday 9am, energy costs , 32 byte min but need 2KB request to amortize row activation energy bill daly [notebook

Ding: Data movement complexity

DMC4ML: Data Movement Complexity for Machine Learning

[https://notebooklm.google.com/notebook/78b74c92-0a68-43c8-8f75-e1a480d75983]

implementing continuous LRU

[gemini]

[deepthink] -- python implementation ([shared])

[chatgpt]