DeepSeek Local Deployment Complete Guide: From Beginner to Expert
With the open-source release of DeepSeek models, more developers and enterprises are looking to run these powerful AI models in their local environments. This guide walks you through three mainstream local deployment methods from scratch, helping you choose the best approach for your use case.
Why Deploy Locally?
Before committing to local deployment, let's understand its core advantages:
Data Privacy & Security
Local deployment means all your data — prompts, conversations, business documents — never leaves your device. For industries dealing with sensitive information like finance, healthcare, and legal, this is the optimal compliance solution. You don't have to worry about data leakage during transmission or depend on third-party data processing agreements.
Ultra-Low Latency
Local inference eliminates network round-trip latency. API calls typically incur 200-500ms of network overhead, while local inference delivers near-instant responses. For real-time applications like code completion and conversational assistants, this difference is significant.
Long-Term Cost Advantage
While the initial hardware investment is substantial, local deployment costs far less than API calls for high-frequency usage scenarios over time. Here's a comparison for 1 million tokens per day:
| Solution | Monthly Cost | Annual Cost |
|---|---|---|
| DeepSeek API Calls | ~$300 | ~$3,600 |
| Local (RTX 4090) | ~$15 (electricity) | ~$180 + one-time hardware |
| Local (Mac Studio M4 Ultra) | ~$8 (electricity) | ~$96 + one-time hardware |
Offline Availability
Local deployment lets you use AI capabilities without an internet connection — on airplanes, in remote areas, or within air-gapped networks.
Hardware Requirements
Different model sizes have different hardware demands. Here are detailed recommended configurations:
NVIDIA GPUs
NVIDIA GPUs offer the most mature local deployment ecosystem with excellent CUDA support and compatibility.
| Model | Min VRAM | Recommended VRAM | Recommended GPU |
|---|---|---|---|
| DeepSeek-R1-1.5B (4-bit) | 2GB | 4GB | RTX 3060 |
| DeepSeek-R1-7B (4-bit) | 6GB | 8GB | RTX 4060 |
| DeepSeek-R1-8B (4-bit) | 6GB | 8GB | RTX 4070 |
| DeepSeek-R1-14B (4-bit) | 10GB | 12GB | RTX 4070 Ti |
| DeepSeek-R1-32B (4-bit) | 20GB | 24GB | RTX 4090 |
| DeepSeek-R1-70B (4-bit) | 40GB | 48GB | A6000 / 2x RTX 4090 |
| DeepSeek-V3 (4-bit) | 160GB | 192GB | 4x A100 80GB |
AMD GPUs
AMD GPUs support large model inference through ROCm, with compatibility continuously improving.
| Recommended GPU | VRAM | Suitable Models |
|---|---|---|
| RX 7900 XTX | 24GB | 7B-14B |
| MI250X | 128GB | 70B |
| MI300X | 192GB | V3 Full |
Apple Silicon
Apple Silicon's unified memory architecture offers a unique advantage for LLM inference — it can use system memory (up to 512GB) to load models.
| Chip | Unified Memory | Suitable Models | Expected Speed |
|---|---|---|---|
| M2/M3 Pro | 18-36GB | 7B-14B | 10-20 tokens/s |
| M2/M3 Max | 32-96GB | 14B-32B | 15-25 tokens/s |
| M4 Pro | 24-48GB | 14B-32B | 20-35 tokens/s |
| M4 Max | 36-128GB | 32B-70B | 25-40 tokens/s |
| M4 Ultra | 192-512GB | 70B-V3 Full | 30-50 tokens/s |
RAM Requirements
Even with GPU inference, sufficient system RAM is important for model loading and context management:
- 7B models: Minimum 16GB, recommended 32GB
- 14B-32B models: Minimum 32GB, recommended 64GB
- 70B models: Minimum 64GB, recommended 128GB
Method 1: Ollama Deployment (Simplest)
Ollama is currently the simplest tool for local LLM deployment, offering one-click installation and single-command model execution.
Installing Ollama
macOS:
# 使用 Homebrew 安装 brew install ollama
Linux:
# 一键安装脚本 curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from ollama.com/download and run it.
Download and Run DeepSeek Models
# 运行 DeepSeek-R1 7B(推荐入门) ollama run deepseek-r1:7b # 运行 DeepSeek-R1 14B ollama run deepseek-r1:14b # 运行 DeepSeek-R1 32B(需要 24GB+ 显存) ollama run deepseek-r1:32b # 运行 DeepSeek-R1 70B(需要 48GB+ 显存或大内存 Mac) ollama run deepseek-r1:70b
Using the Ollama API
Ollama provides an API service at localhost:11434 by default, compatible with the OpenAI API format:
import openai # 创建客户端,指向本地 Ollama 服务 client = openai.OpenAI( base_url="http://localhost:11434/v1", # Ollama 本地地址 api_key="ollama" # Ollama 不需要真实的 API Key ) # 发送聊天请求 response = client.chat.completions.create( model="deepseek-r1:7b", # 指定模型名称 messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain the Transformer architecture."} ], temperature=0.7, # 控制输出随机性 max_tokens=2048 # 最大输出长度 ) # 打印回复内容 print(response.choices[0].message.content)
Common Ollama Commands
# 列出已下载的模型 ollama list # 查看模型详细信息 ollama show deepseek-r1:7b # 删除模型释放空间 ollama rm deepseek-r1:7b # 后台启动 Ollama 服务 ollama serve # 复制模型并重命名 ollama cp deepseek-r1:7b my-deepseek # 查看正在运行的模型 ollama ps
Custom Modelfile
You can customize model behavior with a Modelfile:
# 创建文件 Modelfile FROM deepseek-r1:7b # 设置系统提示词 SYSTEM """You are a professional programming assistant skilled in Python and JavaScript.""" # 调整模型参数 PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 8192
# 基于 Modelfile 创建自定义模型 ollama create my-coding-assistant -f Modelfile # 运行自定义模型 ollama run my-coding-assistant
Method 2: vLLM Deployment (High-Performance Inference)
vLLM is a high-performance LLM inference and serving framework that achieves efficient memory management through PagedAttention technology. It's particularly suitable for production environments and high-concurrency scenarios.
Installing vLLM
# 创建虚拟环境(推荐) python -m venv vllm-env source vllm-env/bin/activate # 安装 vLLM(需要 NVIDIA GPU + CUDA 12.1+) pip install vllm
Starting the vLLM Inference Server
# 启动 OpenAI 兼容的 API 服务 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 \ --dtype auto \ --trust-remote-code
Advanced vLLM Configuration
# 多 GPU 张量并行(适用于大模型) python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ --tensor-parallel-size 2 \ --max-model-len 16384 \ --gpu-memory-utilization 0.95 \ --enable-prefix-caching \ --host 0.0.0.0 \ --port 8000 # 使用量化模型降低显存需求 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --quantization awq \ --max-model-len 8192 \ --host 0.0.0.0 \ --port 8000
Calling the vLLM API
import openai # 连接到本地 vLLM 服务 client = openai.OpenAI( base_url="http://localhost:8000/v1", # vLLM 本地地址 api_key="not-needed" # 本地部署无需 API Key ) # 流式输出示例 stream = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", messages=[ {"role": "user", "content": "Write a quicksort algorithm in Python"} ], stream=True, # 开启流式输出 temperature=0.3 ) # 逐字打印流式输出 for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
vLLM vs Ollama Comparison
| Feature | Ollama | vLLM |
|---|---|---|
| Installation Difficulty | Very Easy | Medium |
| Performance | Good | Excellent (+20-50%) |
| Concurrency Support | Basic | Excellent (Production-grade) |
| Memory Efficiency | Average | Excellent (PagedAttention) |
| Apple Silicon | Full Support | Not Supported |
| Best For | Personal use, development | Production, high concurrency |
Method 3: Docker Deployment
Docker deployment provides excellent environment isolation and portability, making it ideal for team collaboration and production deployments.
Using the Ollama Docker Image
# 拉取 Ollama 官方 Docker 镜像 docker pull ollama/ollama # CPU 模式运行 docker run -d \ --name ollama \ -v ollama_data:/root/.ollama \ -p 11434:11434 \ ollama/ollama # NVIDIA GPU 模式运行(需要 nvidia-container-toolkit) docker run -d \ --name ollama-gpu \ --gpus all \ -v ollama_data:/root/.ollama \ -p 11434:11434 \ ollama/ollama # 进入容器下载并运行模型 docker exec -it ollama-gpu ollama run deepseek-r1:7b
Docker Compose Orchestration
Create a docker-compose.yml file:
version: '3.8' services: # Ollama 推理服务 ollama: image: ollama/ollama:latest container_name: deepseek-ollama ports: - "11434:11434" volumes: - ollama_data:/root/.ollama # 持久化模型数据 deploy: resources: reservations: devices: - driver: nvidia count: all # 使用所有可用 GPU capabilities: [gpu] restart: unless-stopped # Open WebUI - 提供网页聊天界面 open-webui: image: ghcr.io/open-webui/open-webui:main container_name: deepseek-webui ports: - "3000:8080" environment: - OLLAMA_BASE_URL=http://ollama:11434 # 连接到 Ollama 服务 volumes: - webui_data:/app/backend/data depends_on: - ollama restart: unless-stopped volumes: ollama_data: # 模型存储卷 webui_data: # WebUI 数据卷
# 启动所有服务 docker compose up -d # 查看服务状态 docker compose ps # 查看日志 docker compose logs -f ollama # 停止服务 docker compose down
vLLM Docker Deployment
# 使用 vLLM 官方 Docker 镜像 docker run -d \ --name vllm-deepseek \ --gpus all \ -v huggingface_cache:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --max-model-len 8192 \ --gpu-memory-utilization 0.9
Choosing Quantization Versions
Quantization is a key technique for reducing model size and memory requirements. Different quantization levels offer different trade-offs between quality and resource consumption.
Quantization Precision Comparison
| Precision | Model Size (7B) | VRAM Usage | Quality Loss | Inference Speed | Best For |
|---|---|---|---|---|---|
| FP16 (Original) | ~14GB | ~16GB | None | Baseline | Quality-first, ample VRAM |
| 8-bit (INT8) | ~7GB | ~9GB | Minimal | +10-20% | Balanced choice |
| 4-bit (Q4_K_M) | ~4GB | ~6GB | Small | +30-50% | Recommended for limited VRAM |
| 4-bit (Q4_0) | ~3.8GB | ~5.5GB | Small | +40-60% | Extreme VRAM constraints |
| 3-bit | ~2.8GB | ~4.5GB | Noticeable | +50-70% | Not recommended |
| 2-bit | ~2GB | ~3.5GB | Severe | +60-80% | Testing only |
How to Choose?
Recommended Strategy:
- VRAM >= 1.2x model FP16 size — Use FP16 for best quality
- VRAM tight but > INT8 model size — Use 8-bit quantization
- Limited VRAM — Use 4-bit quantization (Q4_K_M), the best value choice
- Extreme scenarios — Use Q4_0, accept slight quality degradation
Quantization in Ollama
# Ollama 默认使用 Q4_K_M 量化,适合大多数场景 ollama run deepseek-r1:7b # 指定量化版本 ollama run deepseek-r1:7b-q8_0 # 8-bit 量化 ollama run deepseek-r1:7b-q4_K_M # 4-bit 量化(默认) ollama run deepseek-r1:7b-fp16 # FP16 原始精度
Performance Benchmarks
Below are real-world performance measurements for running DeepSeek models on different hardware configurations (tokens/s, generation speed):
DeepSeek-R1-7B (4-bit Quantization)
| Hardware | First Token Latency | Generation Speed | Notes |
|---|---|---|---|
| RTX 3060 12GB | ~150ms | 35-45 tokens/s | Entry-level GPU |
| RTX 4060 8GB | ~120ms | 45-55 tokens/s | Best value |
| RTX 4070 Ti 12GB | ~80ms | 60-75 tokens/s | Recommended |
| RTX 4090 24GB | ~50ms | 90-110 tokens/s | Top performance |
| M3 Pro 18GB | ~200ms | 18-25 tokens/s | MacBook Pro |
| M4 Pro 24GB | ~150ms | 28-35 tokens/s | Latest Mac |
| M4 Max 48GB | ~100ms | 35-45 tokens/s | High-end Mac |
DeepSeek-R1-32B (4-bit Quantization)
| Hardware | First Token Latency | Generation Speed | Notes |
|---|---|---|---|
| RTX 4090 24GB | ~200ms | 25-35 tokens/s | Just fits |
| A6000 48GB | ~150ms | 35-45 tokens/s | Professional GPU |
| 2x RTX 4090 | ~180ms | 40-55 tokens/s | Dual GPU parallel |
| M4 Max 64GB | ~300ms | 18-25 tokens/s | Unified memory advantage |
| M4 Ultra 192GB | ~200ms | 30-40 tokens/s | Most powerful Mac |
DeepSeek-R1-70B (4-bit Quantization)
| Hardware | First Token Latency | Generation Speed | Notes |
|---|---|---|---|
| 2x RTX 4090 48GB | ~500ms | 12-18 tokens/s | Barely fits |
| A100 80GB | ~300ms | 25-35 tokens/s | Data center grade |
| 2x A100 80GB | ~200ms | 40-55 tokens/s | High concurrency recommended |
| M4 Ultra 192GB | ~400ms | 15-22 tokens/s | One Mac running 70B |
Performance on Apple Silicon (M4 Ultra)
The Apple M4 Ultra is currently one of the most powerful local inference platforms available to individual users. With 192GB of unified memory, it can run 70B-class models and even attempt loading the full DeepSeek-V3.
Unique Advantages of M4 Ultra
- Unified Memory Architecture: CPU and GPU share memory with no data copying, enabling highly efficient model loading
- Massive Memory Bandwidth: M4 Ultra delivers up to 819.2 GB/s memory bandwidth, significantly boosting inference speed
- Exceptional Power Efficiency: Total system power consumption of only 60-150W, far lower than NVIDIA GPU solutions
- Silent Operation: Mac Studio runs nearly silent, perfect for office and home environments
- Works Out of the Box: Ollama natively supports Metal with no CUDA configuration needed
M4 Ultra Benchmark Results
测试环境:Mac Studio M4 Ultra, 192GB 统一内存, macOS 15.4
DeepSeek-R1-7B (Q4_K_M):
├── 加载时间: 1.2s
├── 首 Token: ~80ms
├── 生成速度: 42 tokens/s
└── 内存占用: ~5GB
DeepSeek-R1-32B (Q4_K_M):
├── 加载时间: 8.5s
├── 首 Token: ~200ms
├── 生成速度: 32 tokens/s
└── 内存占用: ~20GB
DeepSeek-R1-70B (Q4_K_M):
├── 加载时间: 25s
├── 首 Token: ~400ms
├── 生成速度: 18 tokens/s
└── 内存占用: ~42GB
DeepSeek-V3-671B (Q4_K_M, experimental):
├── 加载时间: ~5min
├── 首 Token: ~3s
├── 生成速度: 2-4 tokens/s
└── 内存占用: ~170GB
M4 Ultra Deployment Recommendations
# 安装 Ollama(已原生支持 Metal 加速) brew install ollama # 运行推荐的 32B 模型(M4 Ultra 的最佳平衡点) ollama run deepseek-r1:32b # 如果你有 192GB 内存,可以尝试 70B ollama run deepseek-r1:70b # 设置并发数以充分利用 M4 Ultra 的算力 OLLAMA_NUM_PARALLEL=4 ollama serve
Cost Comparison with API Calls
Scenario 1: Individual Developer (~50K tokens/day)
| Solution | Monthly Cost | Annual Cost | Notes |
|---|---|---|---|
| DeepSeek API | ~$22 | ~$264 | Pay-as-you-go, flexible |
| Ollama + RTX 4060 | ~$5 (electricity) | $60 + $300 (hardware) | Year 1: $360, then $60/year |
| Ollama + M4 Pro Mac | ~$3 (electricity) | $36 + $2,399 (hardware) | Cost-effective long-term |
Conclusion: For light individual use, API is more cost-effective.
Scenario 2: Small Team (~500K tokens/day)
| Solution | Monthly Cost | Annual Cost | Notes |
|---|---|---|---|
| DeepSeek API | ~$220 | ~$2,640 | Stable, no maintenance |
| vLLM + RTX 4090 | ~$15 (electricity) | $180 + $1,600 (hardware) | ROI within 1 year |
| vLLM + A6000 | ~$20 (electricity) | $240 + $4,500 (hardware) | Larger models, better concurrency |
Conclusion: For high-frequency use, local deployment recovers hardware costs within 1 year.
Scenario 3: Enterprise (~5M tokens/day)
| Solution | Monthly Cost | Annual Cost | Notes |
|---|---|---|---|
| DeepSeek API | ~$2,200 | ~$26,400 | Possible rate limits |
| vLLM + 4x A100 | ~$200 (electricity) | $2,400 + $60,000 (hardware) | ROI within 2 years, full control |
| Cloud GPU (on-demand) | ~$3,000 | ~$36,000 | Flexible, no hardware maintenance |
Conclusion: For enterprise-level high-frequency use, self-hosted inference clusters are most cost-effective long-term.
Cost Decision Tree
What is your daily token usage?
├── < 10K tokens → Use API, local deployment not worth it
├── 10K-100K tokens → Depends on privacy needs
│ ├── Need privacy → Local deployment (Ollama + consumer GPU)
│ └── No privacy concerns → API is more convenient
├── 100K-1M tokens → Local deployment starts having cost advantages
│ ├── Individual/Small team → Ollama + RTX 4090
│ └── Need high concurrency → vLLM + professional GPU
└── > 1M tokens → Strongly recommend local deployment
├── Medium budget → vLLM + multi-GPU consumer setup
└── Ample budget → vLLM + A100/H100 cluster
Troubleshooting Common Issues
Issue 1: Slow Model Downloads
# 设置 Ollama 使用镜像源(中国用户) export OLLAMA_HOST=https://ollama.mirrors.example.com # 或者手动下载模型后导入 ollama create deepseek-r1:7b -f /path/to/Modelfile
If you have HuggingFace model files, you can also specify a local GGUF file path via a Modelfile.
Issue 2: CUDA Out of Memory (OOM)
# 降低 GPU 显存使用率 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --gpu-memory-utilization 0.8 \ # 从 0.9 降到 0.8 --max-model-len 4096 # 减小上下文长度 # 或者使用更激进的量化 ollama run deepseek-r1:7b-q4_0 # 使用 Q4_0 代替 Q4_K_M
Issue 3: Slow Performance on Apple Silicon
# 确保使用 Metal 加速(Ollama 默认启用) # 检查是否正确使用 GPU ollama run deepseek-r1:7b --verbose # 关闭其他占用内存的应用,释放更多统一内存给模型 # 在活动监视器中检查内存压力 # 增加 Ollama 使用的 GPU 层数 OLLAMA_NUM_GPU=999 ollama run deepseek-r1:7b
Issue 4: Docker Container Cannot Access GPU
# 安装 NVIDIA Container Toolkit sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker # 验证 GPU 是否可用 docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi # 如果仍有问题,检查 Docker daemon 配置 sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Issue 5: Poor Model Output Quality
- Check quantization precision: If using too-low quantization (e.g., 2-bit), quality degrades noticeably. Use at least Q4_K_M
- Adjust temperature: Use 0.1-0.3 for code tasks, 0.5-0.7 for conversation, 0.8-1.0 for creative writing
- Review system prompt: Ensure your system prompt is clear and specific
- Increase context length: Some tasks require a longer context window
Issue 6: Slow Response Under Multi-User Concurrency
# Ollama 设置并发数 OLLAMA_NUM_PARALLEL=4 ollama serve # vLLM 已内置高效的并发处理 # 可以通过增加 GPU 数量来提升并发能力 python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --tensor-parallel-size 2 \ --max-num-seqs 32 # 最大并发序列数
Summary
| Method | Best For | Difficulty | Performance | Rating |
|---|---|---|---|---|
| Ollama | Individual developers, beginners | Easy | Good | Highly Recommended |
| vLLM | Production, high concurrency | Medium | Excellent | Recommended |
| Docker | Team collaboration, standardized deployment | Easy-Medium | Very Good | Recommended |
Recommended path for beginners: Start with Ollama for quick experimentation, migrate to vLLM when you need better performance, and use Docker for standardized deployments.
Deploying DeepSeek models locally isn't complicated. Choose the right approach, and you can have a powerful AI model running on your device in minutes. Start your local AI journey today!