Skip to content

Task 11: DeepSeek Engram Offload — ByteDMD Verification

Priority: MEDIUM Status: OPEN Agent: unassigned Source: Issue #77 (observation by Andy via Qwen, experiment plan by Yad)

Context

The DeepSeek Engram paper ("Conditional Memory via Scalable Lookup") claims that 100B parameters can be offloaded to CPU/SSD with <3% inference overhead. From a ByteDMD perspective this claim is suspicious:

  • SSD→GPU bandwidth is 20-60× slower than GPU HBM (~16GB/s vs ~1-3TB/s)
  • Under ByteDMD, SSD reads live at stack depth ~millions vs ~hundreds for HBM
  • Per-byte cost ratio: ceil(sqrt(1e6)) / ceil(sqrt(1e3)) ≈ 31×
  • If the "3%" is wall-time (async prefetching hiding latency) rather than energy, ByteDMD should expose that

Relevance to SutroYaro: this is exactly the "wall-time ≠ energy" confusion that motivates the ByteDMD metric. Either outcome of this task produces valuable output:

  • Claim validated → heuristic genuinely reduces deep reads; document the pattern as a positive case study
  • Claim is wall-time gaming → document as the canonical case study for why SutroYaro uses ByteDMD and not throughput

Paper: https://deepseek.ai/blog/deepseek-engram-v4-architecture

Tasks

Phase 1 — Paper extraction (prerequisite)

  • Read https://deepseek.ai/blog/deepseek-engram-v4-architecture carefully
  • Extract whether "overhead" is wall-time, FLOPs, or energy
  • Extract concrete m:M ratio (resident : offloaded set sizes)
  • Extract lookup frequency p (fraction of tokens that hit the offloaded set)
  • Document the prefetch heuristic (predicted-hot entries? random? LRU? model-guided?)
  • Write to docs/findings/engram-offload-paper-read.md

Without these specifics, Phase 2 models a generic offload pattern, not Engram specifically.

Phase 2 — ByteDMD microbenchmark

  • Create experiments/engram-offload/model.py — a toy attention block (d_model=64, seq_len=128) with a KV memory bank split into resident vs offloaded tiers. Pure Python so ByteDMD traces all reads.
  • Create experiments/engram-offload/run.py with three variants, identical compute:
  • resident: all weights at small stack depth (baseline)
  • offload_naive: offloaded bank at depth ~1e6, every lookup pays ceil(sqrt(1e6))=1000 per byte
  • offload_prefetch: paper's heuristic (promote predicted-hot entries to top each step)
  • Run bytedmd(forward, (tokens, kv_bank)) for each variant. Record cost, cost-per-token, trace distribution
  • Wall-time control: time the same three on CPU with time.perf_counter() to reproduce the "<3%" wall-time claim locally
  • Write findings to docs/findings/engram-offload-bytedmd.md following findings/_template.md

Decision thresholds

  • Go (claim is real): ByteDMD overhead of offload_prefetch vs resident is 2-5×. Heuristic genuinely reduces deep reads.
  • No-go (wall-time gaming): ByteDMD overhead >20× while wall-time overhead stays <3%. Write up as case study.
  • Ambiguous: 5-20×. Re-run with larger M and varying p; bandwidth-hiding advantage should shrink as working set grows.

Out of scope

  • Re-training the 100B model. Testing the metric, not the paper.
  • PCIe energy models. ByteDMD's sqrt(depth) is the proxy.

References