Take-aways¶

Cache lemma means LRU cache is within 2x factor of optimal cache. Because optimal caching strategy is hard to compute, just assume LRU cache for everything. Continuous LRU cache and and data movement complexity of Ding are promising ([paper])

ARD is likely a sufficiently good metric. It's impossible to drive ARD to zero without improving energy efficiency of algorithms. Data Movement Cost is likely a better metric, by virtue of being known in the literature, and directly corresponding to energy cost for a specific kind of physical 2D memory layout.

Notes¶

Motivation: We have used average reuse distance before but I had some doubts about how interesting it is. I want to avoid metrics which, when optimizing, could create an algorithm with absolutely no relevance either to the physical world or to the literature world (meaning scientific literature). I went on a fishing expedition to remind myself how energy works.

Now we have a hierarchy of caches but there is a great simplification which is known as the square root rule. Basically the size of the cache and the cost, both in terms of latency and in terms of energy, can be approximated as the square root of the square root of the size of the cache. The reason for this can be tracked down to basic physics. If you imagine you have something of size k, the size of the wire to access that cache is, it might be, k squared, as long as you lay it out on a two-dimensional grid.

The second great simplification is that the last recently used cache is within a factor of 2 of the optimal cache. An optimal cache, imagine if you're using some piece of data and then you have a capacity conflict; you might evict this piece of data if it has been used before. An optimal cache can see the future and it would just say, \"Hey don't evict it cause we might use it.\" Because of the simplification we just can assume we use the last recently used cache.

With these two simplifications we can use a much simpler heuristic. The reason it is interesting is because:

Conceptually an algorithm that optimizes this could be built on chip. We just use wires to collect the rings and concentric circles and that's what will be the energy cost that we will have.
The second reason it is interesting is because it connects to existing literature of DING.

-----

Goal is to get all the relevant background to double check on the right proxy metric. I used Average Reuse Distance, check if it's a good metric. [notability]

Ultimate metric is energy efficiency on a common GPU like H100, However, iterating on an actual GPU is too slow, so we need to find a simpler proxy. Understand better the cache issue and iterate on various proxy metrics.

Research GPU availabilities and costs: [https://chatgpt.com/share/69ac8b16-3e54-8011-8b4c-9b76a357da9e]

morning, Iterate on understanding of proxies in WebGPUmode ([gemini])
Summarize main concepts in \"energy proxy metrics\" [notability]

- kinetic locality Reflects the energy needed to keep memory resident between instructions. Probably negligible, so ignore that for now.

- Ultimate metric appears to be the reuse distance histogram, but that's not a single number.

- Average Reuse Distance remains important.

look into working set size and ARD in WebGPUmode ([gemini])

Videos:

[Herb Sutter @ NWCPP: Machine Architecture: Things Your Programming Language Neve] ([slides])

[code::dive conference 2014 - Scott Meyers: Cpu Caches and Why You Care] ([slides])

[14. Caching and Cache-Efficient Algorithms]

Herb Sutter¶

Herb Sutter video: ground truth of memory wall

slides: [https://nwcpp.org/talks/2007/Machine_Architecture_-_NWCPP.pdf]

video: [Herb Sutter @ NWCPP: Machine Architecture: Things Your Programming Language Neve]

[notebook]

(99% of software complexity goes into hiding latency, due to Memory Wall. Little Law)

updating latency diagram to modern times [gemini] (no change in latency)
How GPUs changed over last 10 years: [https://gemini.google.com/share/38e52f298618]

[https://nwcpp.org/talks/2007/Machine_]

[embedded image]

[https://youtu.be/xDKnMXtZKq8?si=atySjQ79DdWimyGn]