How DeepSeek V4 Delivers Low-Cost Million-Token Context: CSA + HCA Hybrid Attention Explained

On April 24, 2026, DeepSeek officially released and open-sourced (MIT license) DeepSeek V4 in two variants: V4-Pro (1.6 trillion total parameters / 49B active parameters), aimed at high-end reasoning and agentic coding, and the faster, cheaper V4-Flash (284B total parameters / 13B active parameters). Both default to a 1 million (1M) token context window, with a maximum output of roughly 384K tokens.

What actually turns "million-token context" from an expensive lab feature into a cheap, everyday capability is not the pre-launch rumor of some "infinite memory system" (briefly nicknamed Engram in the pre-release chatter — but that was only a rumored name), but the hybrid attention architecture V4 actually ships: CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention). This article focuses on that real mechanism — how it crushes compute and memory at 1M context, and what that means for long documents, entire codebases, and long-conversation memory.

The Fundamental Dilemma of Long Context

O(n²) Attention Complexity: The Insurmountable Computational Wall

The standard Transformer's self-attention mechanism has a complexity of O(n²), where n is the sequence length. This means:

Context Length	Attention Computation	KV Cache Memory (FP16)	Inference Latency
4K tokens	16M operations	~0.5 GB	~50ms
32K tokens	1B operations	~8 GB	~400ms
128K tokens	16B operations	~128 GB	~6s
1M tokens	1T operations	~8 TB	~6min

When context expands from 4K to 1M, computation increases roughly 62,500x, and KV Cache memory explodes alongside it. Even with optimizations like FlashAttention and Ring Attention, these only reduce the constant factor without changing the fundamental quadratic growth — which is why million-token context has long been "a game only those who can afford the compute get to play."

Limitations of Existing Solutions

Sliding Window Attention

# Sliding window illustration (window size w)
# Each token only attends to w tokens before and after
Attention range: [i-w, i+w]
Complexity: O(n·w)  # Linear, but loses long-range dependencies

Sliding windows reduce complexity to linear, but at the cost of completely losing the ability to capture long-range information. For long document tasks requiring cross-chapter reasoning, this is a fatal flaw.

Static Sparse Attention

Traditional sparse attention reduces computation through predefined sparse patterns (e.g., local + global), but suffers from two problems:

Sparse patterns are static and cannot dynamically adjust based on content
Critical information may fall exactly in the positions that are sparsified away

Retrieval-Augmented Generation (RAG)

RAG splits long documents into chunks and retrieves relevant segments through vector search. However, RAG is essentially a "bolt-on" system:

Retrieval quality depends on the embedding model, leading to semantic loss
Cannot handle tasks requiring holistic understanding (e.g., thematic analysis of an entire book)
Chunk boundary splitting may break contextual coherence
Adds system complexity and latency

V4's Answer: CSA + HCA Hybrid Attention

DeepSeek V4 did not adopt any single one of the approaches above. Instead, on top of an MoE (Mixture-of-Experts) backbone, it builds a hybrid attention architecture that combines two complementary attention mechanisms — preserving long-range information while dramatically cutting compute and memory.

CSA: Compressed Sparse Attention

CSA addresses the question of "which tokens are worth spending full-precision compute to attend to." It dynamically partitions the sequence into compressed blocks, performs content-driven sparse selection over block-level representations, and only expands fine-grained attention over the regions that are genuinely relevant.

Unlike static sparse attention, CSA's sparsity is a content-driven dynamic selection rather than a fixed pattern. This means critical information is no longer discarded just because it happens to land in the "blind spot" of a fixed sparse pattern.

HCA: Heavily Compressed Attention

HCA addresses the question of "how to keep the KV Cache from exploding with sequence length." It heavily compresses the key-value representations, retaining only a compact compressed state in memory, thereby driving down the KV Cache memory footprint of ultra-long context.

How They Work Together

CSA "computes less" (reducing per-token attention computation) while HCA "stores less" (reducing KV Cache memory). Working together, they let V4 hit the officially published efficiency figures at 1M context:

Metric	Relative to baseline (V3.2)	Meaning
Per-token compute	~27%	Less than a third of the compute to process the same context length
KV Cache memory	~10%	About one-tenth the memory for the same context length

In other words, processing 1 million tokens, V4's per-token compute is about 27% of V3.2 and its KV Cache memory is about 10% of V3.2. This is not the brute force of "making the window bigger" — it is structural savings from rethinking the attention mechanism itself.

A note on naming: before launch, the community used names like "Engram memory system" and "DSA" to speculate about V4's long-context mechanism, but the official release on April 24 ships CSA + HCA hybrid attention. This article follows the official release facts.

Comparison with Traditional KV Cache Approaches

Dimension	Standard Full Attention + Full KV Cache	V4: CSA + HCA
Attention compute complexity	O(n²)	Near-linear (sparse block selection)
KV Cache memory	O(n), grows linearly with a large constant	Heavily compressed, ~10% of baseline
Sparsity pattern	None / static	Content-driven dynamic sparsity
Long-range dependencies	Complete but expensive	Preserves key long-range information
Million-token usability	Compute/memory cost prohibitive	Cost-friendly structure, affordable pricing

The most critical difference: traditional approaches either "see everything but cost too much" or "sacrifice long-range information to save money." CSA + HCA strikes an engineering balance between the two extremes — preserving the key associations spanning a million tokens while pushing compute and memory down to commercially viable levels.

Real Pricing: Making Million-Token Context Cheap

The ultimate meaning of efficiency shows up in price. V4's long-term API pricing (after a 75% cut) is as follows:

Variant	Input (per 1M tokens)	Output (per 1M tokens)
V4-Pro	$0.435	$0.87
V4-Flash	$0.14	$0.28

Compared to closed-source frontier models (GPT-5.4, Claude 4.6, Gemini 3.1 Pro), V4 is typically about 5–30x cheaper for comparable long-context capability. This means stuffing an entire book, an entire codebase, or hundreds of conversation turns into the context window is no longer a budget luxury — it's a routine, everyday operation.

The legacy deepseek-chat and deepseek-reasoner models will be retired on July 24, 2026; migration to deepseek-v4-pro / deepseek-v4-flash is recommended. Access is available via chat.deepseek.com (Expert Mode / Instant Mode), the official API, and Atlas Cloud.

What This Means in Practice

Long Document Processing

Thanks to CSA's dynamic sparsity and HCA's memory compression, V4 can read a document of hundreds of thousands of tokens fully into a single forward pass — no chunking, no bolt-on retrieval needed:

Traditional approach: Split document → Process separately → Merge results (severe information loss)

V4 approach: Read the whole document at once → Full attention covers key associations → Maintain global understanding

For tasks like 200-page contract review, whole-book summarization, and cross-document citation checking, "seeing the entire text" is itself a guarantee of quality — and CSA + HCA make "seeing the entire text" cheap.

An Entire Codebase

Loading a medium-to-large code repository (hundreds of thousands to over a million tokens) into the context window at once lets the model perform cross-file refactoring, bug localization, and agentic coding within the complete project context. This is one of the real foundations behind V4's 80.6% on SWE-bench Verified (the highest among open-source models, tied with Gemini 3.1 Pro) — the combination of long context, extreme efficiency, and strong coding ability.

Long-Conversation Memory

In multi-turn conversation and long-term collaboration scenarios, the 1M-token context paired with low pricing lets the model:

Retain the complete conversation history rather than truncating or summarizing it
Accurately recall details mentioned hundreds of turns ago
Maintain complete project context throughout extended collaborative programming

It's worth emphasizing: this is a capability of the long context window itself, made affordable by CSA + HCA — not some separate "persistent memory database."

V4 Key Benchmark Results

V4-Pro's real, published results on mainstream benchmarks:

Benchmark	DeepSeek V4-Pro
SWE-bench Verified	80.6% (highest open-source, tied with Gemini 3.1 Pro)
LiveCodeBench Pass@1	93.5
Codeforces Rating	3206
MMLU-Pro	87.5%
GPQA Diamond	90.1%
GSM8K	92.6%
Terminal-Bench 2.0	67.9%

These results corroborate the "million-token context + extreme efficiency" story: long context is not an isolated selling point but the infrastructure underpinning V4's agentic coding and complex reasoning capabilities.

Technical Outlook

CSA + HCA represents a pragmatic direction for LLM long-context management: rather than piling on compute to "make the window bigger," it structurally redesigns the attention mechanism so that long context becomes affordable in both compute and memory. When the marginal cost of million-token context is low enough, "just put all the relevant information in" becomes the default, rather than an engineering trade-off you have to weigh repeatedly.

As the architecture continues to iterate, there's room for the cost of long context to fall further — and V4's CSA + HCA has already turned "low-cost million-token context" from a concept into a reality you can use, and afford, today.

This article is based on the V4 information officially released by DeepSeek on 2026-04-24 (architecture, context, pricing, benchmarks). Some third-party benchmark data may change as evaluations are updated.

How DeepSeek V4 Delivers Low-Cost Million-Token Context: CSA + HCA Hybrid Attention Explained

How DeepSeek V4 Delivers Low-Cost Million-Token Context: CSA + HCA Hybrid Attention Explained

The Fundamental Dilemma of Long Context

O(n²) Attention Complexity: The Insurmountable Computational Wall

Limitations of Existing Solutions

V4's Answer: CSA + HCA Hybrid Attention

CSA: Compressed Sparse Attention

HCA: Heavily Compressed Attention

How They Work Together

Comparison with Traditional KV Cache Approaches

Real Pricing: Making Million-Token Context Cheap

What This Means in Practice

Long Document Processing

An Entire Codebase

Long-Conversation Memory

V4 Key Benchmark Results

Technical Outlook

Try DeepSeek Now