DeepSeek Engram: Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs
Автор: MillionScope
Загружено: 2026-01-13
Просмотров: 43
Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs
Abstract
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack native knowledge lookup, forcing inefficient retrieval simulation. We introduce conditional memory as a complementary sparsity axis through Engram—a modernized n-gram embedding module.
We identify a U-shaped scaling law optimizing the trade-off between neural computation (MoE) and static memory (Engram). Our 27B-parameter Engram model surpasses iso-parameter/iso-FLOPs MoE baselines, with unexpected gains in reasoning (BBH, ARC-Challenge) and code/math (HumanEval, MATH) beyond knowledge tasks (MMLU, CMMLU).
Engram relieves early layers from static reconstruction, effectively deepening the network. By delegating local dependencies to lookups, it frees attention for global context, boosting long-context retrieval. Deterministic addressing enables prefetching from host memory with negligible overhead.
1. Introduction
Language modeling involves two distinct tasks:
1. Compositional reasoning: Requires deep, dynamic computation
2. Knowledge retrieval: Local, static patterns (entities, formulas)
Transformers simulate retrieval through expensive computation—reconstructing static lookup tables at runtime. We propose Engram, a conditional memory module combining classic n-gram structure with modern adaptations: tokenizer compression, multi-head hashing, contextualized gating, and multi-branch integration.
2. Architecture
Sparse Retrieval: Multi-head hashing maps contexts to embedding tables. Vocabulary projection reduces size by ~23% by collapsing semantically equivalent tokens.
Context-aware Gating: Attention-inspired mechanism uses current hidden states as queries against retrieved memory (keys/values), producing scalar gates that weight contributions.
Multi-branch Integration: Parameter-sharing strategy: single embedding table shared across branches, with branch-specific gating matrices.
System Efficiency: Deterministic addressing enables asynchronous prefetching from host DRAM during GPU computation of preceding layers. Zipfian n-gram distribution allows effective multi-level caching.
3. Scaling Laws
Sparsity Allocation Problem: How to distribute fixed parameter budgets between MoE experts and Engram memory?
Key Finding: U-shaped relationship between validation loss and allocation ratio α. Pure MoE (α=1) is suboptimal; allocating ~20-25% to Engram (α≈0.75) yields best performance.
Infinite Memory Regime: When memory budget is relaxed, scaling memory slots improves validation loss following a power law, confirming conditional memory as a scalable sparsity axis.
4. Large-Scale Pre-training (262B tokens)
MoE-27B: 26.7B params (72 experts)
Engram-27B: 26.7B params (55 experts + 5.7B memory)
Engram-40B: 18.5B memory parameters
Result: Engram-27B consistently outperforms iso-parameter MoE-27B, with massive gains in reasoning and code/math—not just knowledge retrieval.
5. Long Context (32k via YaRN)
Offloading local dependencies to lookups preserves attention capacity for global context.
6. Analysis
Effective Depth: LogitLens shows Engram achieves faster prediction convergence in early layers. CKA alignment reveals Engram Layer 5 ≈ MoE Layer 12—confirming increased effective depth.
Sensitivity Analysis:
Factual knowledge: 71% degradation when Engram suppressed
Reading comprehension: Only 7% degradation
System Efficiency: 100B-parameter Engram offloading incurs less than 1% throughput penalty via communication-computation overlap.
Gating Visualization: Strong activation on multi-token entities ("Alexander the Great") and formulaic phrases ("By the way").
7. Conclusion
We introduce conditional memory via Engram, establishing U-shaped scaling laws for Sparsity Allocation. Hybrid MoE+Engram strictly outperforms pure MoE. Engram-27B achieves superior efficiency by relieving backbones of static reconstruction and freeing attention for global reasoning. With zero-overhead offloading capability, conditional memory represents an indispensable primitive for next-generation sparse models.
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: