DeepSeek V4 Architecture Deep Dive: MoE + CSA/HCA Hybrid Attention and Million-Token Context (Released)

DeepSeek V4 was officially released on April 24, 2026, fully open-sourced under the MIT license, with weights published on Hugging Face. Before launch, the model appeared under the engineering codename "MODEL1" in open repositories such as FlashMLA, sparking widespread speculation about its architecture. Now that the official details are out, this article drops the pre-release guesswork and analyzes V4's real architecture based on the actual release: how MoE (Mixture-of-Experts) + hybrid attention (CSA + HCA) delivers extreme efficiency at million-token context.

Two Versions: Pro and Flash

V4 launched in two clearly positioned versions:

Version	Total Params	Active Params	Positioning
DeepSeek-V4-Pro	1.6 trillion (1.6T)	49B	High-end reasoning and agentic coding
DeepSeek-V4-Flash	284B	13B	Faster, lower-cost scenarios

Both versions use the MoE (Mixture-of-Experts) architecture. The core idea of MoE is that the model holds a huge total parameter count (to store knowledge) but only activates a small subset of experts per token during inference (to save compute). So Pro has 1.6T total parameters but activates just 49B per token; Flash has 284B total with 13B active. This is the foundation that lets DeepSeek keep strong capability while pushing inference cost extremely low.

Both versions have a context window of 1 million (1M) tokens (default), with maximum output of about 384K tokens.

Core Architecture: Hybrid Attention (CSA + HCA)

V4's most important architectural innovation is not some "memory system" from pre-release rumors, but a hybrid attention architecture that combines two compressed attention mechanisms:

CSA (Compressed Sparse Attention): in long sequences, it performs fine-grained attention only over the truly relevant portions, using sparsification to dramatically reduce the number of token pairs that participate in computation.
HCA (Heavily Compressed Attention): it heavily compresses the key-value representations of attention, keeping distant context reachable at far lower memory and compute cost.

The engineering goal of combining the two is clear: turn million-token context from "theoretically possible" into "cost-effective in practice."

Efficiency Gains (Official Data)

In the most resource-hungry 1M-context scenario, V4's hybrid attention delivers two decisive gains:

Per-token compute ≈ 27% of V3.2: for the same length, the computation required for inference drops sharply.
KV Cache memory ≈ 10% of V3.2: the biggest memory bottleneck for long context is the KV Cache, and V4 cuts it to roughly one-tenth.

This means the same GPU (or cluster) can run full million-token context at a fraction of the previous generation's cost, instead of being crushed by memory and compute costs as sequences grow. This is the fundamental reason V4 can make ultra-long context a default capability and push API prices extremely low.

The Still-Real Technical Foundation: FP8 and MoE Routing

Beyond hybrid attention, V4 continues and strengthens two of DeepSeek's long-standing engineering advantages, both of which remain real after launch:

FP8 Mixed Precision

V4 makes extensive use of the FP8 low-precision numeric format in both training and inference. Compared with traditional FP16/bfloat16, FP8 further reduces memory footprint and bandwidth pressure, and with carefully designed scaling strategies it significantly improves throughput while preserving model quality. This is a capability DeepSeek has refined since the V3 series; on V4 it stacks with hybrid attention to drive per-token cost down together.

MoE Expert Routing

MoE efficiency depends on routing quality—accurately assigning each token to the most suitable experts. V4 continues to optimize routing strategy and load balancing, ensuring that the massive parameter counts of 1.6T (Pro) / 284B (Flash) are scheduled efficiently and stably, avoiding compute waste from skewed expert loads.

Note: before launch, the community circulated packaging such as an "Engram memory system," "DeepSeek Sparse Attention (DSA) as a standalone selling point," and "System 2 pause-and-think." These are not architecture features officially confirmed for V4. What V4 actually uses to achieve low-cost ultra-long context is the CSA + HCA hybrid attention described above.

Comparison with V3 / V3.2

Feature	DeepSeek-V3.2	DeepSeek-V4 (Released)	Change
Architecture	MoE + MLA	MoE + Hybrid Attention (CSA+HCA)	Attention mechanism upgraded
Versions	Single flagship	Pro (1.6T/49B) + Flash (284B/13B)	Two-tier lineup
Context	Shorter	1M tokens (default)	Long context becomes default
Per-token compute (1M)	Baseline	≈ 27% of V3.2	Sharply lower
KV Cache memory (1M)	Baseline	≈ 10% of V3.2	Sharply lower
Numeric precision	FP8 etc.	FP8 (continued, strengthened)	Continued optimization
License	Open source	MIT (open source)	Open source

V4 doesn't simply "crank up the context"—it rewrites the cost structure of long context at the attention-mechanism level, turning million tokens from an expensive experimental capability into a routinely usable default.

Real Benchmark Results

V4-Pro's measured benchmarks after release (not "expected/target"):

Benchmark	Score	Notes
SWE-bench Verified	80.6%	Highest among open models, tied with Gemini 3.1 Pro
LiveCodeBench Pass@1	93.5	Real coding ability
Codeforces rating	3206	Competitive programming
MMLU-Pro	87.5%	General knowledge reasoning
GPQA Diamond	90.1%	Graduate-level science
GSM8K	92.6%	Math word problems
Terminal-Bench 2.0	67.9%	Terminal/agentic tasks

The 80.6% on SWE-bench Verified is especially important—it's the hard metric for whether a model can actually fix issues in real code repositories, and V4 scores highest among open models, tied with the closed frontier model Gemini 3.1 Pro. This aligns perfectly with V4's focus on agentic coding + million-token context: load an entire codebase at once, then use strong coding ability to understand and modify across files.

API Pricing

After a ~75% reduction, V4's pricing sits at a long-term low:

Version	Input (per 1M tokens)	Output (per 1M tokens)
V4-Pro	$0.435	$0.87
V4-Flash	$0.14	$0.28

Compared with closed frontier models, V4 typically costs roughly 5–30x less while holding the same tier of capability, fundamentally changing the cost structure for large-scale, long-context, agentic-coding workloads.

How to Use

V4 is available right now—no waiting required:

chat.deepseek.com: offers Expert Mode and Instant Mode.
Official API: use the model name deepseek-v4-pro. Note that the legacy models deepseek-chat and deepseek-reasoner will be retired on July 24, 2026, so migrate in time.
Atlas Cloud: provides V4 access as well.

API call example (pseudocode):

# Call V4-Pro and load an entire codebase into the 1M-token context at once
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a senior engineer doing cross-file refactoring."},
        {"role": "user", "content": whole_repo_as_text},  # up to ~1M tokens
    ],
)

Conclusion

DeepSeek V4 defines the next-generation open-source flagship in a pragmatic—rather than flashy—way:

MoE two-version lineup: Pro (1.6T/49B) for high-end reasoning and agentic coding, Flash (284B/13B) for high-speed, low-cost use.
CSA + HCA hybrid attention: cuts per-token compute at million-token context to ~27% of V3.2 and KV Cache memory to ~10%, turning ultra-long context from an expensive experiment into a daily default.
Strong coding ability: SWE-bench Verified 80.6%, highest among open models, tied with Gemini 3.1 Pro.
Fully open source (MIT) + very low pricing: Pro $0.435/$0.87, Flash $0.14/$0.28 (per 1M tokens).

The era of the "MODEL1" codename is over. As a formally released, immediately usable open model, V4 delivers "low-cost ultra-long context + agentic coding" directly into developers' hands.

Sources

The following is information from DeepSeek's official release (2026-04-24) and related public sources:

DeepSeek official website
DeepSeek open-source weights on Hugging Face
chat.deepseek.com / official API docs / Atlas Cloud

Disclaimer: Model architecture and pricing are subject to DeepSeek's official release; some third-party benchmark figures may change as evaluations are updated.

Last updated: April 25, 2026

DeepSeek V4 Architecture Deep Dive: MoE + CSA/HCA Hybrid Attention and Million-Token Context (Released)

DeepSeek V4 Architecture Deep Dive: MoE + CSA/HCA Hybrid Attention and Million-Token Context (Released)

Two Versions: Pro and Flash

Core Architecture: Hybrid Attention (CSA + HCA)

Efficiency Gains (Official Data)

The Still-Real Technical Foundation: FP8 and MoE Routing

FP8 Mixed Precision

MoE Expert Routing

Comparison with V3 / V3.2

Real Benchmark Results

API Pricing

How to Use

Conclusion

Sources

Try DeepSeek Now