DeepSeek V4

DeepSeek Local Deployment Complete Guide: From Beginner to Expert

A comprehensive guide on deploying DeepSeek models locally, covering Ollama, vLLM, and Docker methods, hardware requirements, quantization options, performance benchmarks, and troubleshooting.

Tutorials
DeepSeek AI Team2026-03-0812 min read
#deepseek#local-deployment#ollama#docker#tutorial

DeepSeek Local Deployment Complete Guide: From Beginner to Expert

With the open-source release of DeepSeek models, more developers and enterprises are looking to run these powerful AI models in their local environments. This guide walks you through three mainstream local deployment methods from scratch, helping you choose the best approach for your use case.

Why Deploy Locally?

Before committing to local deployment, let's understand its core advantages:

Data Privacy & Security

Local deployment means all your data — prompts, conversations, business documents — never leaves your device. For industries dealing with sensitive information like finance, healthcare, and legal, this is the optimal compliance solution. You don't have to worry about data leakage during transmission or depend on third-party data processing agreements.

Ultra-Low Latency

Local inference eliminates network round-trip latency. API calls typically incur 200-500ms of network overhead, while local inference delivers near-instant responses. For real-time applications like code completion and conversational assistants, this difference is significant.

Long-Term Cost Advantage

While the initial hardware investment is substantial, local deployment costs far less than API calls for high-frequency usage scenarios over time. Here's a comparison for 1 million tokens per day:

SolutionMonthly CostAnnual Cost
DeepSeek API Calls~$300~$3,600
Local (RTX 4090)~$15 (electricity)~$180 + one-time hardware
Local (Mac Studio M4 Ultra)~$8 (electricity)~$96 + one-time hardware

Offline Availability

Local deployment lets you use AI capabilities without an internet connection — on airplanes, in remote areas, or within air-gapped networks.


Hardware Requirements

Different model sizes have different hardware demands. Here are detailed recommended configurations:

NVIDIA GPUs

NVIDIA GPUs offer the most mature local deployment ecosystem with excellent CUDA support and compatibility.

ModelMin VRAMRecommended VRAMRecommended GPU
DeepSeek-R1-1.5B (4-bit)2GB4GBRTX 3060
DeepSeek-R1-7B (4-bit)6GB8GBRTX 4060
DeepSeek-R1-8B (4-bit)6GB8GBRTX 4070
DeepSeek-R1-14B (4-bit)10GB12GBRTX 4070 Ti
DeepSeek-R1-32B (4-bit)20GB24GBRTX 4090
DeepSeek-R1-70B (4-bit)40GB48GBA6000 / 2x RTX 4090
DeepSeek-V3 (4-bit)160GB192GB4x A100 80GB

AMD GPUs

AMD GPUs support large model inference through ROCm, with compatibility continuously improving.

Recommended GPUVRAMSuitable Models
RX 7900 XTX24GB7B-14B
MI250X128GB70B
MI300X192GBV3 Full

Apple Silicon

Apple Silicon's unified memory architecture offers a unique advantage for LLM inference — it can use system memory (up to 512GB) to load models.

ChipUnified MemorySuitable ModelsExpected Speed
M2/M3 Pro18-36GB7B-14B10-20 tokens/s
M2/M3 Max32-96GB14B-32B15-25 tokens/s
M4 Pro24-48GB14B-32B20-35 tokens/s
M4 Max36-128GB32B-70B25-40 tokens/s
M4 Ultra192-512GB70B-V3 Full30-50 tokens/s

RAM Requirements

Even with GPU inference, sufficient system RAM is important for model loading and context management:

  • 7B models: Minimum 16GB, recommended 32GB
  • 14B-32B models: Minimum 32GB, recommended 64GB
  • 70B models: Minimum 64GB, recommended 128GB

Method 1: Ollama Deployment (Simplest)

Ollama is currently the simplest tool for local LLM deployment, offering one-click installation and single-command model execution.

Installing Ollama

macOS:

# 使用 Homebrew 安装 brew install ollama

Linux:

# 一键安装脚本 curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com/download and run it.

Download and Run DeepSeek Models

# 运行 DeepSeek-R1 7B(推荐入门) ollama run deepseek-r1:7b # 运行 DeepSeek-R1 14B ollama run deepseek-r1:14b # 运行 DeepSeek-R1 32B(需要 24GB+ 显存) ollama run deepseek-r1:32b # 运行 DeepSeek-R1 70B(需要 48GB+ 显存或大内存 Mac) ollama run deepseek-r1:70b

Using the Ollama API

Ollama provides an API service at localhost:11434 by default, compatible with the OpenAI API format:

import openai # 创建客户端,指向本地 Ollama 服务 client = openai.OpenAI( base_url="http://localhost:11434/v1", # Ollama 本地地址 api_key="ollama" # Ollama 不需要真实的 API Key ) # 发送聊天请求 response = client.chat.completions.create( model="deepseek-r1:7b", # 指定模型名称 messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain the Transformer architecture."} ], temperature=0.7, # 控制输出随机性 max_tokens=2048 # 最大输出长度 ) # 打印回复内容 print(response.choices[0].message.content)

Common Ollama Commands

# 列出已下载的模型 ollama list # 查看模型详细信息 ollama show deepseek-r1:7b # 删除模型释放空间 ollama rm deepseek-r1:7b # 后台启动 Ollama 服务 ollama serve # 复制模型并重命名 ollama cp deepseek-r1:7b my-deepseek # 查看正在运行的模型 ollama ps

Custom Modelfile

You can customize model behavior with a Modelfile:

# 创建文件 Modelfile FROM deepseek-r1:7b # 设置系统提示词 SYSTEM """You are a professional programming assistant skilled in Python and JavaScript.""" # 调整模型参数 PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 8192
# 基于 Modelfile 创建自定义模型 ollama create my-coding-assistant -f Modelfile # 运行自定义模型 ollama run my-coding-assistant

Method 2: vLLM Deployment (High-Performance Inference)

vLLM is a high-performance LLM inference and serving framework that achieves efficient memory management through PagedAttention technology. It's particularly suitable for production environments and high-concurrency scenarios.

Installing vLLM

# 创建虚拟环境(推荐) python -m venv vllm-env source vllm-env/bin/activate # 安装 vLLM(需要 NVIDIA GPU + CUDA 12.1+) pip install vllm

Starting the vLLM Inference Server

# 启动 OpenAI 兼容的 API 服务 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 \ --dtype auto \ --trust-remote-code

Advanced vLLM Configuration

# 多 GPU 张量并行(适用于大模型) python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ --tensor-parallel-size 2 \ --max-model-len 16384 \ --gpu-memory-utilization 0.95 \ --enable-prefix-caching \ --host 0.0.0.0 \ --port 8000 # 使用量化模型降低显存需求 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --quantization awq \ --max-model-len 8192 \ --host 0.0.0.0 \ --port 8000

Calling the vLLM API

import openai # 连接到本地 vLLM 服务 client = openai.OpenAI( base_url="http://localhost:8000/v1", # vLLM 本地地址 api_key="not-needed" # 本地部署无需 API Key ) # 流式输出示例 stream = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", messages=[ {"role": "user", "content": "Write a quicksort algorithm in Python"} ], stream=True, # 开启流式输出 temperature=0.3 ) # 逐字打印流式输出 for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)

vLLM vs Ollama Comparison

FeatureOllamavLLM
Installation DifficultyVery EasyMedium
PerformanceGoodExcellent (+20-50%)
Concurrency SupportBasicExcellent (Production-grade)
Memory EfficiencyAverageExcellent (PagedAttention)
Apple SiliconFull SupportNot Supported
Best ForPersonal use, developmentProduction, high concurrency

Method 3: Docker Deployment

Docker deployment provides excellent environment isolation and portability, making it ideal for team collaboration and production deployments.

Using the Ollama Docker Image

# 拉取 Ollama 官方 Docker 镜像 docker pull ollama/ollama # CPU 模式运行 docker run -d \ --name ollama \ -v ollama_data:/root/.ollama \ -p 11434:11434 \ ollama/ollama # NVIDIA GPU 模式运行(需要 nvidia-container-toolkit) docker run -d \ --name ollama-gpu \ --gpus all \ -v ollama_data:/root/.ollama \ -p 11434:11434 \ ollama/ollama # 进入容器下载并运行模型 docker exec -it ollama-gpu ollama run deepseek-r1:7b

Docker Compose Orchestration

Create a docker-compose.yml file:

version: '3.8' services: # Ollama 推理服务 ollama: image: ollama/ollama:latest container_name: deepseek-ollama ports: - "11434:11434" volumes: - ollama_data:/root/.ollama # 持久化模型数据 deploy: resources: reservations: devices: - driver: nvidia count: all # 使用所有可用 GPU capabilities: [gpu] restart: unless-stopped # Open WebUI - 提供网页聊天界面 open-webui: image: ghcr.io/open-webui/open-webui:main container_name: deepseek-webui ports: - "3000:8080" environment: - OLLAMA_BASE_URL=http://ollama:11434 # 连接到 Ollama 服务 volumes: - webui_data:/app/backend/data depends_on: - ollama restart: unless-stopped volumes: ollama_data: # 模型存储卷 webui_data: # WebUI 数据卷
# 启动所有服务 docker compose up -d # 查看服务状态 docker compose ps # 查看日志 docker compose logs -f ollama # 停止服务 docker compose down

vLLM Docker Deployment

# 使用 vLLM 官方 Docker 镜像 docker run -d \ --name vllm-deepseek \ --gpus all \ -v huggingface_cache:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --max-model-len 8192 \ --gpu-memory-utilization 0.9

Choosing Quantization Versions

Quantization is a key technique for reducing model size and memory requirements. Different quantization levels offer different trade-offs between quality and resource consumption.

Quantization Precision Comparison

PrecisionModel Size (7B)VRAM UsageQuality LossInference SpeedBest For
FP16 (Original)~14GB~16GBNoneBaselineQuality-first, ample VRAM
8-bit (INT8)~7GB~9GBMinimal+10-20%Balanced choice
4-bit (Q4_K_M)~4GB~6GBSmall+30-50%Recommended for limited VRAM
4-bit (Q4_0)~3.8GB~5.5GBSmall+40-60%Extreme VRAM constraints
3-bit~2.8GB~4.5GBNoticeable+50-70%Not recommended
2-bit~2GB~3.5GBSevere+60-80%Testing only

How to Choose?

Recommended Strategy:

  1. VRAM >= 1.2x model FP16 size — Use FP16 for best quality
  2. VRAM tight but > INT8 model size — Use 8-bit quantization
  3. Limited VRAM — Use 4-bit quantization (Q4_K_M), the best value choice
  4. Extreme scenarios — Use Q4_0, accept slight quality degradation

Quantization in Ollama

# Ollama 默认使用 Q4_K_M 量化,适合大多数场景 ollama run deepseek-r1:7b # 指定量化版本 ollama run deepseek-r1:7b-q8_0 # 8-bit 量化 ollama run deepseek-r1:7b-q4_K_M # 4-bit 量化(默认) ollama run deepseek-r1:7b-fp16 # FP16 原始精度

Performance Benchmarks

Below are real-world performance measurements for running DeepSeek models on different hardware configurations (tokens/s, generation speed):

DeepSeek-R1-7B (4-bit Quantization)

HardwareFirst Token LatencyGeneration SpeedNotes
RTX 3060 12GB~150ms35-45 tokens/sEntry-level GPU
RTX 4060 8GB~120ms45-55 tokens/sBest value
RTX 4070 Ti 12GB~80ms60-75 tokens/sRecommended
RTX 4090 24GB~50ms90-110 tokens/sTop performance
M3 Pro 18GB~200ms18-25 tokens/sMacBook Pro
M4 Pro 24GB~150ms28-35 tokens/sLatest Mac
M4 Max 48GB~100ms35-45 tokens/sHigh-end Mac

DeepSeek-R1-32B (4-bit Quantization)

HardwareFirst Token LatencyGeneration SpeedNotes
RTX 4090 24GB~200ms25-35 tokens/sJust fits
A6000 48GB~150ms35-45 tokens/sProfessional GPU
2x RTX 4090~180ms40-55 tokens/sDual GPU parallel
M4 Max 64GB~300ms18-25 tokens/sUnified memory advantage
M4 Ultra 192GB~200ms30-40 tokens/sMost powerful Mac

DeepSeek-R1-70B (4-bit Quantization)

HardwareFirst Token LatencyGeneration SpeedNotes
2x RTX 4090 48GB~500ms12-18 tokens/sBarely fits
A100 80GB~300ms25-35 tokens/sData center grade
2x A100 80GB~200ms40-55 tokens/sHigh concurrency recommended
M4 Ultra 192GB~400ms15-22 tokens/sOne Mac running 70B

Performance on Apple Silicon (M4 Ultra)

The Apple M4 Ultra is currently one of the most powerful local inference platforms available to individual users. With 192GB of unified memory, it can run 70B-class models and even attempt loading the full DeepSeek-V3.

Unique Advantages of M4 Ultra

  1. Unified Memory Architecture: CPU and GPU share memory with no data copying, enabling highly efficient model loading
  2. Massive Memory Bandwidth: M4 Ultra delivers up to 819.2 GB/s memory bandwidth, significantly boosting inference speed
  3. Exceptional Power Efficiency: Total system power consumption of only 60-150W, far lower than NVIDIA GPU solutions
  4. Silent Operation: Mac Studio runs nearly silent, perfect for office and home environments
  5. Works Out of the Box: Ollama natively supports Metal with no CUDA configuration needed

M4 Ultra Benchmark Results

测试环境:Mac Studio M4 Ultra, 192GB 统一内存, macOS 15.4

DeepSeek-R1-7B (Q4_K_M):
  ├── 加载时间: 1.2s
  ├── 首 Token: ~80ms
  ├── 生成速度: 42 tokens/s
  └── 内存占用: ~5GB

DeepSeek-R1-32B (Q4_K_M):
  ├── 加载时间: 8.5s
  ├── 首 Token: ~200ms
  ├── 生成速度: 32 tokens/s
  └── 内存占用: ~20GB

DeepSeek-R1-70B (Q4_K_M):
  ├── 加载时间: 25s
  ├── 首 Token: ~400ms
  ├── 生成速度: 18 tokens/s
  └── 内存占用: ~42GB

DeepSeek-V3-671B (Q4_K_M, experimental):
  ├── 加载时间: ~5min
  ├── 首 Token: ~3s
  ├── 生成速度: 2-4 tokens/s
  └── 内存占用: ~170GB

M4 Ultra Deployment Recommendations

# 安装 Ollama(已原生支持 Metal 加速) brew install ollama # 运行推荐的 32B 模型(M4 Ultra 的最佳平衡点) ollama run deepseek-r1:32b # 如果你有 192GB 内存,可以尝试 70B ollama run deepseek-r1:70b # 设置并发数以充分利用 M4 Ultra 的算力 OLLAMA_NUM_PARALLEL=4 ollama serve

Cost Comparison with API Calls

Scenario 1: Individual Developer (~50K tokens/day)

SolutionMonthly CostAnnual CostNotes
DeepSeek API~$22~$264Pay-as-you-go, flexible
Ollama + RTX 4060~$5 (electricity)$60 + $300 (hardware)Year 1: $360, then $60/year
Ollama + M4 Pro Mac~$3 (electricity)$36 + $2,399 (hardware)Cost-effective long-term

Conclusion: For light individual use, API is more cost-effective.

Scenario 2: Small Team (~500K tokens/day)

SolutionMonthly CostAnnual CostNotes
DeepSeek API~$220~$2,640Stable, no maintenance
vLLM + RTX 4090~$15 (electricity)$180 + $1,600 (hardware)ROI within 1 year
vLLM + A6000~$20 (electricity)$240 + $4,500 (hardware)Larger models, better concurrency

Conclusion: For high-frequency use, local deployment recovers hardware costs within 1 year.

Scenario 3: Enterprise (~5M tokens/day)

SolutionMonthly CostAnnual CostNotes
DeepSeek API~$2,200~$26,400Possible rate limits
vLLM + 4x A100~$200 (electricity)$2,400 + $60,000 (hardware)ROI within 2 years, full control
Cloud GPU (on-demand)~$3,000~$36,000Flexible, no hardware maintenance

Conclusion: For enterprise-level high-frequency use, self-hosted inference clusters are most cost-effective long-term.

Cost Decision Tree

What is your daily token usage?
├── < 10K tokens → Use API, local deployment not worth it
├── 10K-100K tokens → Depends on privacy needs
│   ├── Need privacy → Local deployment (Ollama + consumer GPU)
│   └── No privacy concerns → API is more convenient
├── 100K-1M tokens → Local deployment starts having cost advantages
│   ├── Individual/Small team → Ollama + RTX 4090
│   └── Need high concurrency → vLLM + professional GPU
└── > 1M tokens → Strongly recommend local deployment
    ├── Medium budget → vLLM + multi-GPU consumer setup
    └── Ample budget → vLLM + A100/H100 cluster

Troubleshooting Common Issues

Issue 1: Slow Model Downloads

# 设置 Ollama 使用镜像源(中国用户) export OLLAMA_HOST=https://ollama.mirrors.example.com # 或者手动下载模型后导入 ollama create deepseek-r1:7b -f /path/to/Modelfile

If you have HuggingFace model files, you can also specify a local GGUF file path via a Modelfile.

Issue 2: CUDA Out of Memory (OOM)

# 降低 GPU 显存使用率 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --gpu-memory-utilization 0.8 \ # 从 0.9 降到 0.8 --max-model-len 4096 # 减小上下文长度 # 或者使用更激进的量化 ollama run deepseek-r1:7b-q4_0 # 使用 Q4_0 代替 Q4_K_M

Issue 3: Slow Performance on Apple Silicon

# 确保使用 Metal 加速(Ollama 默认启用) # 检查是否正确使用 GPU ollama run deepseek-r1:7b --verbose # 关闭其他占用内存的应用,释放更多统一内存给模型 # 在活动监视器中检查内存压力 # 增加 Ollama 使用的 GPU 层数 OLLAMA_NUM_GPU=999 ollama run deepseek-r1:7b

Issue 4: Docker Container Cannot Access GPU

# 安装 NVIDIA Container Toolkit sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker # 验证 GPU 是否可用 docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi # 如果仍有问题,检查 Docker daemon 配置 sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Issue 5: Poor Model Output Quality

  • Check quantization precision: If using too-low quantization (e.g., 2-bit), quality degrades noticeably. Use at least Q4_K_M
  • Adjust temperature: Use 0.1-0.3 for code tasks, 0.5-0.7 for conversation, 0.8-1.0 for creative writing
  • Review system prompt: Ensure your system prompt is clear and specific
  • Increase context length: Some tasks require a longer context window

Issue 6: Slow Response Under Multi-User Concurrency

# Ollama 设置并发数 OLLAMA_NUM_PARALLEL=4 ollama serve # vLLM 已内置高效的并发处理 # 可以通过增加 GPU 数量来提升并发能力 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --tensor-parallel-size 2 \ --max-num-seqs 32 # 最大并发序列数

Summary

MethodBest ForDifficultyPerformanceRating
OllamaIndividual developers, beginnersEasyGoodHighly Recommended
vLLMProduction, high concurrencyMediumExcellentRecommended
DockerTeam collaboration, standardized deploymentEasy-MediumVery GoodRecommended

Recommended path for beginners: Start with Ollama for quick experimentation, migrate to vLLM when you need better performance, and use Docker for standardized deployments.

Deploying DeepSeek models locally isn't complicated. Choose the right approach, and you can have a powerful AI model running on your device in minutes. Start your local AI journey today!

Try DeepSeek Now

Try all features mentioned in this article for free on Atlas Cloud

Try Free