DeepSeek Engram: Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs

Автор: MillionScope

Загружено: 2026-01-13

Просмотров: 43

Описание:

Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs

Abstract

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack native knowledge lookup, forcing inefficient retrieval simulation. We introduce conditional memory as a complementary sparsity axis through Engram—a modernized n-gram embedding module.

We identify a U-shaped scaling law optimizing the trade-off between neural computation (MoE) and static memory (Engram). Our 27B-parameter Engram model surpasses iso-parameter/iso-FLOPs MoE baselines, with unexpected gains in reasoning (BBH, ARC-Challenge) and code/math (HumanEval, MATH) beyond knowledge tasks (MMLU, CMMLU).

Engram relieves early layers from static reconstruction, effectively deepening the network. By delegating local dependencies to lookups, it frees attention for global context, boosting long-context retrieval. Deterministic addressing enables prefetching from host memory with negligible overhead.

1. Introduction

Language modeling involves two distinct tasks:
1. Compositional reasoning: Requires deep, dynamic computation
2. Knowledge retrieval: Local, static patterns (entities, formulas)

Transformers simulate retrieval through expensive computation—reconstructing static lookup tables at runtime. We propose Engram, a conditional memory module combining classic n-gram structure with modern adaptations: tokenizer compression, multi-head hashing, contextualized gating, and multi-branch integration.

2. Architecture

Sparse Retrieval: Multi-head hashing maps contexts to embedding tables. Vocabulary projection reduces size by ~23% by collapsing semantically equivalent tokens.

Context-aware Gating: Attention-inspired mechanism uses current hidden states as queries against retrieved memory (keys/values), producing scalar gates that weight contributions.

Multi-branch Integration: Parameter-sharing strategy: single embedding table shared across branches, with branch-specific gating matrices.

System Efficiency: Deterministic addressing enables asynchronous prefetching from host DRAM during GPU computation of preceding layers. Zipfian n-gram distribution allows effective multi-level caching.

3. Scaling Laws

Sparsity Allocation Problem: How to distribute fixed parameter budgets between MoE experts and Engram memory?

Key Finding: U-shaped relationship between validation loss and allocation ratio α. Pure MoE (α=1) is suboptimal; allocating ~20-25% to Engram (α≈0.75) yields best performance.

Infinite Memory Regime: When memory budget is relaxed, scaling memory slots improves validation loss following a power law, confirming conditional memory as a scalable sparsity axis.

4. Large-Scale Pre-training (262B tokens)

MoE-27B: 26.7B params (72 experts)
Engram-27B: 26.7B params (55 experts + 5.7B memory)
Engram-40B: 18.5B memory parameters

Result: Engram-27B consistently outperforms iso-parameter MoE-27B, with massive gains in reasoning and code/math—not just knowledge retrieval.

5. Long Context (32k via YaRN)

Offloading local dependencies to lookups preserves attention capacity for global context.

6. Analysis

Effective Depth: LogitLens shows Engram achieves faster prediction convergence in early layers. CKA alignment reveals Engram Layer 5 ≈ MoE Layer 12—confirming increased effective depth.

Sensitivity Analysis:
Factual knowledge: 71% degradation when Engram suppressed
Reading comprehension: Only 7% degradation

System Efficiency: 100B-parameter Engram offloading incurs less than 1% throughput penalty via communication-computation overlap.

Gating Visualization: Strong activation on multi-token entities ("Alexander the Great") and formulaic phrases ("By the way").

7. Conclusion

We introduce conditional memory via Engram, establishing U-shaped scaling laws for Sparsity Allocation. Hybrid MoE+Engram strictly outperforms pure MoE. Engram-27B achieves superior efficiency by relieving backbones of static reconstruction and freeing attention for global reasoning. With zero-overhead offloading capability, conditional memory represents an indispensable primitive for next-generation sparse models.

DeepSeek Engram: Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

5 Бесплатных Программ для Монтажа: Shorts, Reels & TikTok + ССЫЛКИ

5 Бесплатных Программ для Монтажа: Shorts, Reels & TikTok + ССЫЛКИ

UCIEKAM PRZED TSUNAMI BRAINROT w ROBLOX!

UCIEKAM PRZED TSUNAMI BRAINROT w ROBLOX!

Я в опасности

Код работает в 100 раз медленнее из-за ложного разделения ресурсов.

Код работает в 100 раз медленнее из-за ложного разделения ресурсов.

Why Does Fire BURN? Feynman's Answer Will DESTROY Your Reality

Why Does Fire BURN? Feynman's Answer Will DESTROY Your Reality

The Man Behind Google's AI Machine | Demis Hassabis Interview

The Man Behind Google's AI Machine | Demis Hassabis Interview

Ziemkiewicz: Niemcy już wymazali Polskę z mapy! Szokujące słowa o „sąsiedztwie” z Rosją

Ziemkiewicz: Niemcy już wymazali Polskę z mapy! Szokujące słowa o „sąsiedztwie” z Rosją

CO BY BYŁO GDYBY GRZEGORZ BRAUN PRZEJĄŁ WŁADZĘ W POLSCE? - HEARTS OF IRON 4

CO BY BYŁO GDYBY GRZEGORZ BRAUN PRZEJĄŁ WŁADZĘ W POLSCE? - HEARTS OF IRON 4

Can You Name What You're Looking For?

Can You Name What You're Looking For?

Tesla Bot Gen 3 New Leak – You Won’t Believe Your Eyes!

Tesla Bot Gen 3 New Leak – You Won’t Believe Your Eyes!

No One Understands What Elon Just Said About 2026

No One Understands What Elon Just Said About 2026

Why Everyone Stopped Using Dropbox

Why Everyone Stopped Using Dropbox

MOŁDAWIA W RUMUNII? PREZYDENT I PREMIER KRAJU SĄ ZA

MOŁDAWIA W RUMUNII? PREZYDENT I PREMIER KRAJU SĄ ZA

'AI won't just take your job, it will take EVERY job'

'AI won't just take your job, it will take EVERY job'

Ученые измерили скорость квантовой запутанности — и это противоречит законам физики.

Ученые измерили скорость квантовой запутанности — и это противоречит законам физики.

Najlepszy i Najgorszy Film Każdego Studia Animacji

Najlepszy i Najgorszy Film Każdego Studia Animacji

Zbrojenia Bez Hamulców: Pół Miliona Żołnierzy, Rekord Wydatków i „Bilet do Wojska” dla tysięcy

Zbrojenia Bez Hamulców: Pół Miliona Żołnierzy, Rekord Wydatków i „Bilet do Wojska” dla tysięcy

TRANSMISJA | Lech Poznań - MŠK Žilina

TRANSMISJA | Lech Poznań - MŠK Žilina

How to Escape Google Surveillance: Replace Every Service in 2 Weeks

How to Escape Google Surveillance: Replace Every Service in 2 Weeks

The Time Paradox Hidden Inside Feynman’s Nobel Prize Work

The Time Paradox Hidden Inside Feynman’s Nobel Prize Work