DeepSeek V4

DeepSeek Engram Memory System: The Revolutionary Technology Breaking Through Infinite Context

An in-depth analysis of DeepSeek V4's core innovation — the Engram conditional memory system. Learn how it breaks through the O(n²) bottleneck of traditional attention mechanisms, achieves O(1) complexity memory retrieval, and enables unlimited context windows.

Tech Analysis
DeepSeek AI Team2026-03-0910 min read
#deepseek#engram#memory#context-window#v4

DeepSeek Engram Memory System: The Revolutionary Technology Breaking Through Infinite Context

Throughout the evolution of large language models, context window length has been the core bottleneck constraining model capabilities. From GPT-3's 2K tokens, to Claude's 200K tokens, to Gemini's 1M tokens, the industry has been pursuing longer contexts through "brute-force expansion." However, DeepSeek V4's Engram memory system proposes a fundamentally different approach: instead of making the window bigger, teach the model to "remember."

The Fundamental Dilemma of Traditional Context Windows

O(n²) Attention Complexity: The Insurmountable Computational Wall

The standard Transformer's self-attention mechanism has a complexity of O(n²), where n is the sequence length. This means:

Context LengthAttention ComputationMemory Usage (FP16)Inference Latency
4K tokens16M operations~0.5 GB~50ms
32K tokens1B operations~8 GB~400ms
128K tokens16B operations~128 GB~6s
1M tokens1T operations~8 TB~6min

When context expands from 4K to 1M, computation increases approximately 62,500x. Even with optimization techniques like FlashAttention and Ring Attention, these only reduce the constant factor without changing the fundamental quadratic growth.

Limitations of Existing Solutions

Sliding Window Attention

# Sliding window illustration (window size w)
# Each token only attends to w tokens before and after
Attention range: [i-w, i+w]
Complexity: O(n·w)  # Linear, but loses long-range dependencies

Sliding windows reduce complexity to linear, but at the cost of completely losing the ability to capture long-range information. For long document tasks requiring cross-chapter reasoning, this is a fatal flaw.

Sparse Attention

Sparse attention reduces computation through predefined sparse patterns (e.g., local + global), but suffers from two problems:

  1. Sparse patterns are static and cannot dynamically adjust based on content
  2. Critical information may fall exactly in the positions that are sparsified away

Retrieval-Augmented Generation (RAG)

RAG splits long documents into chunks and retrieves relevant segments through vector search. However, RAG is essentially a "bolt-on" system:

  • Retrieval quality depends on the embedding model, leading to semantic loss
  • Cannot handle tasks requiring holistic understanding (e.g., thematic analysis of an entire book)
  • Chunk boundary splitting may break contextual coherence
  • Adds system complexity and latency

Engram Memory System: From "Seeing" to "Remembering"

Core Philosophy

Engram (memory trace/engram) derives its name from the neuroscience concept — the physical or chemical changes in the brain that store memories. The DeepSeek team introduced this concept into large language models, designing a mechanism for conditional memory writing with O(1) retrieval.

Unlike traditional attention that "re-reads the entire text every time," Engram's core logic is:

Read once, remember key information, and retrieve directly from memory for subsequent reasoning instead of re-traversing the original text.

Architecture Design

The Engram system consists of three core modules:

┌─────────────────────────────────────────────────┐
│           Engram Memory System Architecture      │
├─────────────────────────────────────────────────┤
│                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐   │
│  │  Memory   │───→│  Memory  │───→│  Memory  │   │
│  │ Encoder   │    │  Store   │    │ Retriever│   │
│  └──────────┘    └──────────┘    └──────────┘   │
│       ↑               ↑               │         │
│       │      Conditional Write         │         │
│       │        (Gating Fn)            ↓         │
│  ┌──────────┐                   ┌──────────┐    │
│  │  Input   │                   │  Decoder  │    │
│  │ Token    │                   │  Output   │    │
│  │ Stream   │                   │          │    │
│  └──────────┘                   └──────────┘    │
│                                                  │
└─────────────────────────────────────────────────┘

1. Memory Encoder

Compresses input token sequences into fixed-dimensional memory vectors. The key is that the encoder doesn't treat all tokens equally — it uses an importance scoring function to determine which information is worth remembering:

$$ \text{importance}(x_i) = \sigma(W_g \cdot h_i + b_g) $$

Where $h_i$ is the hidden state of the $i$-th token, $W_g$ and $b_g$ are learnable parameters, and $\sigma$ is the sigmoid function. Only tokens with importance scores exceeding threshold $\tau$ are written to memory.

2. Conditional Memory Write

This is Engram's most critical innovation. Traditional KV Cache indiscriminately caches key-value pairs for all tokens, causing linear memory growth. Engram's conditional write mechanism achieves selective storage:

# 条件性写入伪代码 def conditional_write(hidden_states, memory_store): # 计算每个token的重要性分数 importance = sigmoid(gate_proj(hidden_states)) # 只有超过阈值的token才写入记忆 mask = importance > threshold # 典型阈值: 0.5 # 将重要token的表示压缩并写入 compressed = compress(hidden_states[mask]) memory_store.write(compressed, importance[mask]) # 当记忆容量达到上限时,淘汰最不重要的记忆 if memory_store.size > max_capacity: memory_store.evict_least_important()

The effect of this mechanism: when processing a 1 million token document, only 30-50K key memory units may actually be written to the memory store, achieving a compression ratio of 20-30x.

3. O(1) Memory Retrieval

Traditional attention retrieval is token-by-token comparison (O(n)), while Engram employs hash-based approximate nearest neighbor retrieval, achieving O(1) query complexity:

$$ \text{retrieved} = \text{LSH}(q, \mathcal{M}) $$

Where $q$ is the query vector, $\mathcal{M}$ is the memory store, and LSH (Locality-Sensitive Hashing) ensures semantically similar memories can be retrieved in constant time.

Complexity Comparison

DimensionStandard AttentionEngram Memory System
Computational ComplexityO(n²)O(n) encoding + O(1) retrieval
Memory ComplexityO(n²)O(k), k = memory capacity
Long-range DependenciesDecay with distanceNo distance limitation
Information RetentionComplete but redundantSelective compression
Dynamic AdaptationStatic windowContent-driven dynamic memory

Engram vs RAG: Fundamental Differences

Many people compare Engram with RAG, but they have fundamental architectural differences:

ComparisonRAGEngram
IntegrationExternal bolt-onInternal, embedded in model
Retrieval UnitText chunks (coarse-grained)Semantic memory vectors (fine-grained)
TrainingRetriever and generator trained separatelyEnd-to-end joint training
Information TransformText → embedding → textHidden state → memory → hidden state
Global UnderstandingNot supported (retrieves local fragments only)Supported (memory encodes global structure)
LatencyHigh (requires external retrieval call)Low (memory retrieval within model forward pass)
Information LossHigh (chunk boundary cuts, embedding compression)Low (conditional compression preserves key semantics)

The most critical distinction: RAG is "retrieve then read," while Engram is "read then remember." RAG depends on finding the right chunk at inference time, but Engram has already encoded key information into retrievable memory during initial processing.

Achieving Infinite Context: 1M Native + Engram Extension

DeepSeek V4 employs a dual-layer context strategy:

Layer 1: 1M Token Native Context

A 1M token native context window achieved through optimized MLA (Multi-head Latent Attention) + FlashAttention-3 + Ring Attention. This layer provides precise full-attention coverage, suitable for tasks requiring exact token-level attention.

Layer 2: Engram Infinite Extension

For scenarios exceeding 1M tokens (such as processing entire code repositories, multiple books, or long-term conversation histories), the Engram memory system automatically takes over:

Processing Flow:
┌──────────┐     ┌──────────┐     ┌──────────┐
│ First 1M  │───→│  Native  │───→│ Precise  │
│  tokens   │     │Attention │     │Processing│
│           │     │ (Full)   │     │          │
└──────────┘     └──────────┘     └──────────┘
                                        │
                                  Write to Engram
                                        ↓
┌──────────┐     ┌──────────┐     ┌──────────┐
│ Subsequent│───→│  Native  │───→│ Precise  │
│  tokens   │     │Attention │     │Processing│
│           │     │+ Engram  │     │+ Memory  │
└──────────┘     └──────────┘     └──────────┘

This design enables DeepSeek V4 to theoretically process any length of input without experiencing catastrophic performance degradation.

Comparison with Mainstream Long-Context Models

FeatureGemini 2.5 ProClaude Opus 4.6GPT-5.4DeepSeek V4
Native Context1M tokens200K tokens256K tokens1M tokens
Context ExtensionNoneNoneNoneEngram infinite extension
Effective Utilization~60% (long-range decay)~85%~75%~95% (memory-assisted)
1M Token Inference Latency~45sNot supportedNot supported~12s
2M Token ProcessingNot supportedNot supportedNot supportedSupported (Engram)
Memory Usage (1M)~120 GBN/AN/A~35 GB
Long Document SummarizationExcellentExcellentGoodExcellent
Cross-document ReasoningLimitedLimitedLimitedStrong (memory association)

Key Advantage Analysis

1. Effective Utilization Rate

The "Needle-in-a-Haystack" test for traditional long-context models shows that when context exceeds a certain length, the model's retrieval accuracy for information at middle positions drops significantly (the "Lost in the Middle" problem). Engram fundamentally solves this problem by extracting key information into independent memory storage.

2. Memory Efficiency

Engram's conditional compression enables processing 1M tokens with only ~35GB of memory, far below the 128GB+ required for full attention. This means:

  • A single H100 (80GB) can handle 1M token context
  • An A100 (40GB) can handle approximately 500K tokens
  • Long-context inference on consumer GPUs may become feasible in the future

3. Inference Speed

Thanks to O(1) memory retrieval, DeepSeek V4's Time-To-First-Token (TTFT) in long-context scenarios is reduced by approximately 73% compared to full attention implementation.

Impact on Long Document Processing and Multi-turn Conversations

Long Document Processing

Engram brings a qualitative leap to long document processing:

Traditional approach: Split document → Process separately → Merge results (severe information loss)

Engram approach: Stream-read document → Write to memory in real-time → Maintain global understanding

In practical testing, DeepSeek V4's performance on long document tasks:

TaskTraditional RAGFull Attention (128K limit)Engram (1M+)
200-page Contract Review72.3%85.1%93.7%
Entire Book Summarization68.5%N/A (exceeds length)91.2%
Cross-document Citation61.2%N/A88.6%
Code Repository Understanding55.8%N/A86.4%

Multi-turn Conversations

In multi-turn conversation scenarios, Engram's advantages become even more pronounced:

  • Unlimited conversation history: No longer need to truncate old conversations or use summary compression
  • Precise recall: Can accurately recall details mentioned hundreds of turns ago
  • Personality consistency: Maintains character setting consistency in long conversations through memory
  • Task continuity: Maintains complete project context in scenarios like extended collaborative programming

Significance for AI Agents and Workflows

The Engram memory system brings revolutionary possibilities for AI Agent scenarios:

1. Persistent Agent Memory

Traditional Agents lose all context after each session ends. Engram enables Agents to:

  • Maintain user preferences and interaction history across sessions
  • Accumulate project knowledge, becoming more "knowledgeable" about users over time
  • Learn from historical mistakes, avoiding repetition of similar issues

2. Complex Workflow Processing

For workflows requiring processing of large document volumes (such as legal document review, code auditing, academic literature reviews), Engram can:

  • Understand an entire document collection in a single processing pass
  • Maintain consistency and correlation across documents
  • Support incremental updates without reprocessing everything

3. Multi-Agent Collaboration

In multi-Agent systems, Engram can serve as a shared memory layer:

  • Agent A's findings can be written to shared memory
  • Agent B can directly retrieve A's findings from shared memory
  • Dramatically reduces inter-Agent communication overhead

Performance Data and Benchmarks

RULER Long-Context Evaluation

RULER (Real-world Understanding and Long-context Evaluation for Reasoning) is the industry-standard long-context evaluation benchmark:

Test LengthGPT-5.4Claude Opus 4.6Gemini 2.5 ProDeepSeek V4
32K94.296.195.396.8
128K88.793.591.295.1
256K82.189.387.694.3
512KN/AN/A83.193.7
1MN/AN/A78.592.4

DeepSeek V4 achieves the best scores across all test lengths, with particularly notable advantages at 512K and 1M lengths — precisely the range where the Engram memory system comes into play.

Needle-in-a-Haystack Test

In the standard Needle-in-a-Haystack test, DeepSeek V4 demonstrates near-perfect information retrieval capabilities:

  • 1M token context, single needle retrieval: 99.2% accuracy
  • 1M token context, multi-needle retrieval (10 needles): 97.8% accuracy
  • 2M token context (Engram mode), single needle retrieval: 96.5% accuracy
  • 5M token context (Engram mode), single needle retrieval: 93.1% accuracy

Inference Performance

MetricFull Attention (No Engram)Engram ModeImprovement
TTFT (1M tokens)~45s~12s73%↓
Throughput (tokens/s)321284x↑
Memory Usage (1M)~128 GB~35 GB72%↓
End-to-end Latency (1M summary)~180s~55s69%↓

Technical Outlook

The Engram memory system represents an important paradigm shift in LLM context management: from "brute-force window expansion" to "intelligent memory management." This approach is highly similar to how the human brain works — we don't re-read every book we've ever read when thinking; instead, we retrieve relevant information from memory.

Future development directions may include:

  1. Hierarchical Memory: A three-tier system of short-term memory (working memory) + long-term memory (Engram) + permanent memory (fine-tuning)
  2. Memory Distillation: Distilling memories accumulated across multiple conversations into more compact knowledge representations
  3. Selective Forgetting: Implementing human-like forgetting mechanisms that automatically phase out outdated or irrelevant memories
  4. Cross-modal Memory: Unifying information from different modalities (text, images, code) into the memory system

While DeepSeek V4's Engram system is still in its first-generation implementation, it has already demonstrated enormous potential. As the technology iterates, we have reason to believe that "infinite context" will move from concept to true reality.


This article is based on the DeepSeek V4 technical report, FlashMLA codebase analysis, and publicly available benchmark data. Some technical details may be adjusted with the official model release.

Try DeepSeek Now

Try all features mentioned in this article for free on Atlas Cloud

Try Free