DeepSeek Engram Offload — ByteDMD Verification¶

You are contributing to the Sutro Group, a research lab studying energy-efficient AI training. Your task is to verify whether the DeepSeek Engram paper's "<3% offload overhead" claim survives when priced under ByteDMD (byte-level data movement), rather than wall-time.

This is a two-phase task. Phase 1 is a paper-reading pass; Phase 2 is a microbenchmark. Do Phase 1 first — without the paper's actual m:M ratio and prefetch heuristic, Phase 2 models a strawman.

Project context¶

Read these files first:

DISCOVERIES.md — what's already known
LAB.md — lab rules (important: do NOT modify measurement code: tracker.py, cache_tracker.py, data.py, config.py, harness.py, fast.py, src/bytedmd/)
docs/research/bytedmd.md — ByteDMD metric spec
docs/tasks/011-engram-offload-bytedmd.md — task spec and decision thresholds
findings/_template.md — expected report format

Key context:

Our metric: ByteDMD — byte-granularity data movement. Reading a value at stack depth d costs ceil(sqrt(d)) per byte. Writes are free. Reference: https://github.com/cybertronai/ByteDMD
The central suspicion: offloading to SSD means reads at stack depth ~1e6, which should cost sqrt(1e6) = 1000 per byte — roughly 31× the cost of HBM reads at depth ~1e3. A 3% wall-time overhead claim is hard to reconcile with that unless the heuristic genuinely reduces how many deep reads happen (not just hides their latency via async prefetch).

Phase 1 — Paper extraction¶

Read the paper carefully¶

Primary source: https://deepseek.ai/blog/deepseek-engram-v4-architecture

Extract the following with quoted evidence from the paper where possible:

Overhead type — does the paper report wall-time, FLOPs, energy, or tokens/sec? The "3%" figure applies to which metric specifically?
m:M ratio — resident set size (in cache/HBM) vs offloaded set size. Can be tokens, KV entries, parameters — whatever the paper offloads. If not stated explicitly, infer from reported configurations.
Lookup frequency p — fraction of forward-pass lookups that hit the offloaded set.
Prefetch heuristic — random? LRU? model-guided prediction of hot entries? Learned routing? Copy-on-access? Describe the mechanism in enough detail to implement a reduced version.
What "offload" physically means — CPU RAM? NVMe SSD? Remote tier? All of the above?

Write `docs/findings/engram-offload-paper-read.md`¶

Use this structure:

# DeepSeek Engram — Paper Extraction

## Source
[URL, date accessed, paper version]

## Overhead claim
[Exact quote. What metric does "3%" refer to?]

## Offload configuration
- m (resident): [size, units, quoted source]
- M (offloaded): [size, units, quoted source]
- m:M ratio: [derived]
- p (lookup fraction): [quoted or inferred]
- Physical tier: [CPU RAM / SSD / other]

## Prefetch heuristic
[2-4 paragraphs describing the mechanism, with code-level detail if possible]

## Gaps
[What the paper does NOT say that Phase 2 will have to assume. Be explicit.]

## Implications for Phase 2
[What values to use for m, M, p, and which heuristic to model]

Do NOT proceed to Phase 2 if: - The paper doesn't report concrete m:M or p (note gaps explicitly; flag for author follow-up) - The "overhead" metric is ambiguous (could be any of wall-time / FLOPs / tokens-per-sec)

In that case, stop at Phase 1 and document the gaps.

Phase 2 — ByteDMD microbenchmark¶

Reduced model¶

Create experiments/engram-offload/model.py: - Toy attention block: d_model=64, seq_len=128, 1 head - KV memory bank split into two tiers: - resident tier (size m): lives at top of ByteDMD stack (depth ~100-1000) - offloaded tier (size M): forced to depth ~1e6 via padding/unused writes - Pure Python values so ByteDMD's tracer sees every read. No numpy in the inner loop (numpy escapes the tracer, see docs/research/bytedmd.md)

Three variants, same compute¶

Create experiments/engram-offload/run.py:

resident — all KV entries resident (baseline). Every lookup is shallow.
offload_naive — M offloaded entries at depth 1e6. Every lookup into the offloaded tier pays sqrt(1e6) = 1000 per byte. No prefetching.
offload_prefetch — paper's heuristic from Phase 1. Predicted-hot entries promoted to the top of the stack each step, reducing how many reads actually pay the deep cost.

All three process the same tokens × kv_bank input and produce the same output (deterministic forward pass; only the memory layout differs).

Measurement¶

For each variant:

from bytedmd import bytedmd, traced_eval

cost = bytedmd(forward, (tokens, kv_bank))
trace, out = traced_eval(forward, (tokens, kv_bank))

Record: - Total ByteDMD cost - Cost per token - Trace depth distribution (histogram of stack depths at each read) - Wall-time with time.perf_counter() for the same forward pass

Save to experiments/engram-offload/results.json.

Findings¶

Create docs/findings/engram-offload-bytedmd.md following findings/_template.md. Must include:

Hypothesis (from the paper-read + your prior)
Config (m, M, p, heuristic — reference Phase 1 extraction)
Results table (all three variants: cost, cost-per-token, wall-time, ratios)
Classification (WIN / LOSS / INVALID / INCONCLUSIVE / BASELINE) per the decision thresholds in docs/tasks/011-engram-offload-bytedmd.md
Analysis — what worked, what didn't, the surprise
Impact on DISCOVERIES.md (if any) — does this add to the ByteDMD-vs-wall-time story?

Decision thresholds¶

From docs/tasks/011-engram-offload-bytedmd.md:

ByteDMD overhead (prefetch vs resident)	Classification	Action
2-5×	WIN for DeepSeek — claim is real	Positive case study; their heuristic genuinely works
>20×	LOSS — wall-time gaming	Canonical case study for why SutroYaro uses ByteDMD
5-20×	INCONCLUSIVE	Re-run with larger `M` and varying `p`; bandwidth-hiding advantage should shrink as working set grows

Rules¶

DO NOT modify measurement code: src/bytedmd/, src/harness.py, tracker.py, cache_tracker.py, data.py, config.py, fast.py
DO NOT modify DISCOVERIES.md, LAB.md, CLAUDE.md, CODEX.md, or any other project configuration. Updates to DISCOVERIES.md are added separately via PR after findings review.
Pure Python in the inner loop. Numpy escapes ByteDMD; see docs/research/bytedmd.md.
All findings go under docs/findings/ (the mkdocs-published location). Do NOT create files in the root-level findings/ directory.
Phase 1 first. Do not start Phase 2 until the paper extraction is written.

Out of scope¶

Re-training or re-implementing the full DeepSeek model. We're testing the metric on a reduced analogue.
Modeling PCIe/NVMe energy directly. ByteDMD's sqrt(depth) is the proxy.
Modifying the ByteDMD tracer to "handle" offloading more cleverly. The metric is the spec; we're measuring against it, not tuning it.