Task 11: DeepSeek Engram Offload — ByteDMD Verification¶
Priority: MEDIUM Status: OPEN Agent: unassigned Source: Issue #77 (observation by Andy via Qwen, experiment plan by Yad)
Context¶
The DeepSeek Engram paper ("Conditional Memory via Scalable Lookup") claims that 100B parameters can be offloaded to CPU/SSD with <3% inference overhead. From a ByteDMD perspective this claim is suspicious:
- SSD→GPU bandwidth is 20-60× slower than GPU HBM (~16GB/s vs ~1-3TB/s)
- Under ByteDMD, SSD reads live at stack depth ~millions vs ~hundreds for HBM
- Per-byte cost ratio:
ceil(sqrt(1e6)) / ceil(sqrt(1e3)) ≈ 31× - If the "3%" is wall-time (async prefetching hiding latency) rather than energy, ByteDMD should expose that
Relevance to SutroYaro: this is exactly the "wall-time ≠ energy" confusion that motivates the ByteDMD metric. Either outcome of this task produces valuable output:
- Claim validated → heuristic genuinely reduces deep reads; document the pattern as a positive case study
- Claim is wall-time gaming → document as the canonical case study for why SutroYaro uses ByteDMD and not throughput
Paper: https://deepseek.ai/blog/deepseek-engram-v4-architecture
Tasks¶
Phase 1 — Paper extraction (prerequisite)¶
- Read https://deepseek.ai/blog/deepseek-engram-v4-architecture carefully
- Extract whether "overhead" is wall-time, FLOPs, or energy
- Extract concrete
m:Mratio (resident : offloaded set sizes) - Extract lookup frequency
p(fraction of tokens that hit the offloaded set) - Document the prefetch heuristic (predicted-hot entries? random? LRU? model-guided?)
- Write to
docs/findings/engram-offload-paper-read.md
Without these specifics, Phase 2 models a generic offload pattern, not Engram specifically.
Phase 2 — ByteDMD microbenchmark¶
- Create
experiments/engram-offload/model.py— a toy attention block (d_model=64, seq_len=128) with a KV memory bank split into resident vs offloaded tiers. Pure Python so ByteDMD traces all reads. - Create
experiments/engram-offload/run.pywith three variants, identical compute: resident: all weights at small stack depth (baseline)offload_naive: offloaded bank at depth ~1e6, every lookup paysceil(sqrt(1e6))=1000per byteoffload_prefetch: paper's heuristic (promote predicted-hot entries to top each step)- Run
bytedmd(forward, (tokens, kv_bank))for each variant. Record cost, cost-per-token, trace distribution - Wall-time control: time the same three on CPU with
time.perf_counter()to reproduce the "<3%" wall-time claim locally - Write findings to
docs/findings/engram-offload-bytedmd.mdfollowingfindings/_template.md
Decision thresholds¶
- Go (claim is real): ByteDMD overhead of
offload_prefetchvsresidentis 2-5×. Heuristic genuinely reduces deep reads. - No-go (wall-time gaming): ByteDMD overhead >20× while wall-time overhead stays <3%. Write up as case study.
- Ambiguous: 5-20×. Re-run with larger
Mand varyingp; bandwidth-hiding advantage should shrink as working set grows.
Out of scope¶
- Re-training the 100B model. Testing the metric, not the paper.
- PCIe energy models. ByteDMD's
sqrt(depth)is the proxy.
References¶
- Agent prompt: docs/agent-prompts/engram-offload.md
- Paper: https://deepseek.ai/blog/deepseek-engram-v4-architecture
- ByteDMD metric: docs/research/bytedmd.md
- ByteDMD repo: https://github.com/cybertronai/ByteDMD
- Origin issue: #77
- Pattern precedent: Task 9 (Muon review) — 009-muon-review.md