DeepSeek Local Deployment Complete Guide: From Beginner to Expert

With the open-source release of DeepSeek models, more developers and enterprises are looking to run these powerful AI models in their local environments. This guide walks you through three mainstream local deployment methods from scratch, helping you choose the best approach for your use case.

Why Deploy Locally?

Before committing to local deployment, let's understand its core advantages:

Data Privacy & Security

Local deployment means all your data — prompts, conversations, business documents — never leaves your device. For industries dealing with sensitive information like finance, healthcare, and legal, this is the optimal compliance solution. You don't have to worry about data leakage during transmission or depend on third-party data processing agreements.

Ultra-Low Latency

Local inference eliminates network round-trip latency. API calls typically incur 200-500ms of network overhead, while local inference delivers near-instant responses. For real-time applications like code completion and conversational assistants, this difference is significant.

Long-Term Cost Advantage

While the initial hardware investment is substantial, local deployment costs far less than API calls for high-frequency usage scenarios over time. Here's a comparison for 1 million tokens per day:

Solution	Monthly Cost	Annual Cost
DeepSeek API Calls	~$300	~$3,600
Local (RTX 4090)	~$15 (electricity)	~$180 + one-time hardware
Local (Mac Studio M4 Ultra)	~$8 (electricity)	~$96 + one-time hardware

Offline Availability

Local deployment lets you use AI capabilities without an internet connection — on airplanes, in remote areas, or within air-gapped networks.

Hardware Requirements

Different model sizes have different hardware demands. Here are detailed recommended configurations:

NVIDIA GPUs

NVIDIA GPUs offer the most mature local deployment ecosystem with excellent CUDA support and compatibility.

Model	Min VRAM	Recommended VRAM	Recommended GPU
DeepSeek-R1-1.5B (4-bit)	2GB	4GB	RTX 3060
DeepSeek-R1-7B (4-bit)	6GB	8GB	RTX 4060
DeepSeek-R1-8B (4-bit)	6GB	8GB	RTX 4070
DeepSeek-R1-14B (4-bit)	10GB	12GB	RTX 4070 Ti
DeepSeek-R1-32B (4-bit)	20GB	24GB	RTX 4090
DeepSeek-R1-70B (4-bit)	40GB	48GB	A6000 / 2x RTX 4090
DeepSeek-V3 (4-bit)	160GB	192GB	4x A100 80GB

AMD GPUs

AMD GPUs support large model inference through ROCm, with compatibility continuously improving.

Recommended GPU	VRAM	Suitable Models
RX 7900 XTX	24GB	7B-14B
MI250X	128GB	70B
MI300X	192GB	V3 Full

Apple Silicon

Apple Silicon's unified memory architecture offers a unique advantage for LLM inference — it can use system memory (up to 512GB) to load models.

Chip	Unified Memory	Suitable Models	Expected Speed
M2/M3 Pro	18-36GB	7B-14B	10-20 tokens/s
M2/M3 Max	32-96GB	14B-32B	15-25 tokens/s
M4 Pro	24-48GB	14B-32B	20-35 tokens/s
M4 Max	36-128GB	32B-70B	25-40 tokens/s
M4 Ultra	192-512GB	70B-V3 Full	30-50 tokens/s

RAM Requirements

Even with GPU inference, sufficient system RAM is important for model loading and context management:

7B models: Minimum 16GB, recommended 32GB
14B-32B models: Minimum 32GB, recommended 64GB
70B models: Minimum 64GB, recommended 128GB

Method 1: Ollama Deployment (Simplest)

Ollama is currently the simplest tool for local LLM deployment, offering one-click installation and single-command model execution.

Installing Ollama

macOS:

# 使用 Homebrew 安装
brew install ollama

Linux:

# 一键安装脚本
curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com/download and run it.

Download and Run DeepSeek Models

# 运行 DeepSeek-R1 7B（推荐入门）
ollama run deepseek-r1:7b

# 运行 DeepSeek-R1 14B
ollama run deepseek-r1:14b

# 运行 DeepSeek-R1 32B（需要 24GB+ 显存）
ollama run deepseek-r1:32b

# 运行 DeepSeek-R1 70B（需要 48GB+ 显存或大内存 Mac）
ollama run deepseek-r1:70b

Using the Ollama API

Ollama provides an API service at localhost:11434 by default, compatible with the OpenAI API format:

import openai

# 创建客户端，指向本地 Ollama 服务
client = openai.OpenAI(
    base_url="http://localhost:11434/v1",  # Ollama 本地地址
    api_key="ollama"  # Ollama 不需要真实的 API Key
)

# 发送聊天请求
response = client.chat.completions.create(
    model="deepseek-r1:7b",  # 指定模型名称
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the Transformer architecture."}
    ],
    temperature=0.7,  # 控制输出随机性
    max_tokens=2048   # 最大输出长度
)

# 打印回复内容
print(response.choices[0].message.content)

Common Ollama Commands

# 列出已下载的模型
ollama list

# 查看模型详细信息
ollama show deepseek-r1:7b

# 删除模型释放空间
ollama rm deepseek-r1:7b

# 后台启动 Ollama 服务
ollama serve

# 复制模型并重命名
ollama cp deepseek-r1:7b my-deepseek

# 查看正在运行的模型
ollama ps

Custom Modelfile

You can customize model behavior with a Modelfile:

# 创建文件 Modelfile
FROM deepseek-r1:7b

# 设置系统提示词
SYSTEM """You are a professional programming assistant skilled in Python and JavaScript."""

# 调整模型参数
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

# 基于 Modelfile 创建自定义模型
ollama create my-coding-assistant -f Modelfile

# 运行自定义模型
ollama run my-coding-assistant

Method 2: vLLM Deployment (High-Performance Inference)

vLLM is a high-performance LLM inference and serving framework that achieves efficient memory management through PagedAttention technology. It's particularly suitable for production environments and high-concurrency scenarios.

Installing vLLM

# 创建虚拟环境（推荐）
python -m venv vllm-env
source vllm-env/bin/activate

# 安装 vLLM（需要 NVIDIA GPU + CUDA 12.1+）
pip install vllm

Starting the vLLM Inference Server

# 启动 OpenAI 兼容的 API 服务
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --dtype auto \
    --trust-remote-code

Advanced vLLM Configuration

# 多 GPU 张量并行（适用于大模型）
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --tensor-parallel-size 2 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.95 \
    --enable-prefix-caching \
    --host 0.0.0.0 \
    --port 8000

# 使用量化模型降低显存需求
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --quantization awq \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000

Calling the vLLM API

import openai

# 连接到本地 vLLM 服务
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",  # vLLM 本地地址
    api_key="not-needed"  # 本地部署无需 API Key
)

# 流式输出示例
stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {"role": "user", "content": "Write a quicksort algorithm in Python"}
    ],
    stream=True,  # 开启流式输出
    temperature=0.3
)

# 逐字打印流式输出
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

vLLM vs Ollama Comparison

Feature	Ollama	vLLM
Installation Difficulty	Very Easy	Medium
Performance	Good	Excellent (+20-50%)
Concurrency Support	Basic	Excellent (Production-grade)
Memory Efficiency	Average	Excellent (PagedAttention)
Apple Silicon	Full Support	Not Supported
Best For	Personal use, development	Production, high concurrency

Method 3: Docker Deployment

Docker deployment provides excellent environment isolation and portability, making it ideal for team collaboration and production deployments.

Using the Ollama Docker Image

# 拉取 Ollama 官方 Docker 镜像
docker pull ollama/ollama

# CPU 模式运行
docker run -d \
    --name ollama \
    -v ollama_data:/root/.ollama \
    -p 11434:11434 \
    ollama/ollama

# NVIDIA GPU 模式运行（需要 nvidia-container-toolkit）
docker run -d \
    --name ollama-gpu \
    --gpus all \
    -v ollama_data:/root/.ollama \
    -p 11434:11434 \
    ollama/ollama

# 进入容器下载并运行模型
docker exec -it ollama-gpu ollama run deepseek-r1:7b

Docker Compose Orchestration

Create a docker-compose.yml file:

version: '3.8'

services:
  # Ollama 推理服务
  ollama:
    image: ollama/ollama:latest
    container_name: deepseek-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama   # 持久化模型数据
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all            # 使用所有可用 GPU
              capabilities: [gpu]
    restart: unless-stopped

  # Open WebUI - 提供网页聊天界面
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: deepseek-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434  # 连接到 Ollama 服务
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:    # 模型存储卷
  webui_data:     # WebUI 数据卷

# 启动所有服务
docker compose up -d

# 查看服务状态
docker compose ps

# 查看日志
docker compose logs -f ollama

# 停止服务
docker compose down

vLLM Docker Deployment

# 使用 vLLM 官方 Docker 镜像
docker run -d \
    --name vllm-deepseek \
    --gpus all \
    -v huggingface_cache:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Choosing Quantization Versions

Quantization is a key technique for reducing model size and memory requirements. Different quantization levels offer different trade-offs between quality and resource consumption.

Quantization Precision Comparison

Precision	Model Size (7B)	VRAM Usage	Quality Loss	Inference Speed	Best For
FP16 (Original)	~14GB	~16GB	None	Baseline	Quality-first, ample VRAM
8-bit (INT8)	~7GB	~9GB	Minimal	+10-20%	Balanced choice
4-bit (Q4_K_M)	~4GB	~6GB	Small	+30-50%	Recommended for limited VRAM
4-bit (Q4_0)	~3.8GB	~5.5GB	Small	+40-60%	Extreme VRAM constraints
3-bit	~2.8GB	~4.5GB	Noticeable	+50-70%	Not recommended
2-bit	~2GB	~3.5GB	Severe	+60-80%	Testing only

How to Choose?

Recommended Strategy:

VRAM >= 1.2x model FP16 size — Use FP16 for best quality
VRAM tight but > INT8 model size — Use 8-bit quantization
Limited VRAM — Use 4-bit quantization (Q4_K_M), the best value choice
Extreme scenarios — Use Q4_0, accept slight quality degradation

Quantization in Ollama

# Ollama 默认使用 Q4_K_M 量化，适合大多数场景
ollama run deepseek-r1:7b

# 指定量化版本
ollama run deepseek-r1:7b-q8_0     # 8-bit 量化
ollama run deepseek-r1:7b-q4_K_M   # 4-bit 量化（默认）
ollama run deepseek-r1:7b-fp16      # FP16 原始精度

Performance Benchmarks

Below are real-world performance measurements for running DeepSeek models on different hardware configurations (tokens/s, generation speed):

DeepSeek-R1-7B (4-bit Quantization)

Hardware	First Token Latency	Generation Speed	Notes
RTX 3060 12GB	~150ms	35-45 tokens/s	Entry-level GPU
RTX 4060 8GB	~120ms	45-55 tokens/s	Best value
RTX 4070 Ti 12GB	~80ms	60-75 tokens/s	Recommended
RTX 4090 24GB	~50ms	90-110 tokens/s	Top performance
M3 Pro 18GB	~200ms	18-25 tokens/s	MacBook Pro
M4 Pro 24GB	~150ms	28-35 tokens/s	Latest Mac
M4 Max 48GB	~100ms	35-45 tokens/s	High-end Mac

DeepSeek-R1-32B (4-bit Quantization)

Hardware	First Token Latency	Generation Speed	Notes
RTX 4090 24GB	~200ms	25-35 tokens/s	Just fits
A6000 48GB	~150ms	35-45 tokens/s	Professional GPU
2x RTX 4090	~180ms	40-55 tokens/s	Dual GPU parallel
M4 Max 64GB	~300ms	18-25 tokens/s	Unified memory advantage
M4 Ultra 192GB	~200ms	30-40 tokens/s	Most powerful Mac

DeepSeek-R1-70B (4-bit Quantization)

Hardware	First Token Latency	Generation Speed	Notes
2x RTX 4090 48GB	~500ms	12-18 tokens/s	Barely fits
A100 80GB	~300ms	25-35 tokens/s	Data center grade
2x A100 80GB	~200ms	40-55 tokens/s	High concurrency recommended
M4 Ultra 192GB	~400ms	15-22 tokens/s	One Mac running 70B

Performance on Apple Silicon (M4 Ultra)

The Apple M4 Ultra is currently one of the most powerful local inference platforms available to individual users. With 192GB of unified memory, it can run 70B-class models and even attempt loading the full DeepSeek-V3.

Unique Advantages of M4 Ultra

Unified Memory Architecture: CPU and GPU share memory with no data copying, enabling highly efficient model loading
Massive Memory Bandwidth: M4 Ultra delivers up to 819.2 GB/s memory bandwidth, significantly boosting inference speed
Exceptional Power Efficiency: Total system power consumption of only 60-150W, far lower than NVIDIA GPU solutions
Silent Operation: Mac Studio runs nearly silent, perfect for office and home environments
Works Out of the Box: Ollama natively supports Metal with no CUDA configuration needed

M4 Ultra Benchmark Results

测试环境：Mac Studio M4 Ultra, 192GB 统一内存, macOS 15.4

DeepSeek-R1-7B (Q4_K_M):
  ├── 加载时间: 1.2s
  ├── 首 Token: ~80ms
  ├── 生成速度: 42 tokens/s
  └── 内存占用: ~5GB

DeepSeek-R1-32B (Q4_K_M):
  ├── 加载时间: 8.5s
  ├── 首 Token: ~200ms
  ├── 生成速度: 32 tokens/s
  └── 内存占用: ~20GB

DeepSeek-R1-70B (Q4_K_M):
  ├── 加载时间: 25s
  ├── 首 Token: ~400ms
  ├── 生成速度: 18 tokens/s
  └── 内存占用: ~42GB

DeepSeek-V3-671B (Q4_K_M, experimental):
  ├── 加载时间: ~5min
  ├── 首 Token: ~3s
  ├── 生成速度: 2-4 tokens/s
  └── 内存占用: ~170GB

M4 Ultra Deployment Recommendations

# 安装 Ollama（已原生支持 Metal 加速）
brew install ollama

# 运行推荐的 32B 模型（M4 Ultra 的最佳平衡点）
ollama run deepseek-r1:32b

# 如果你有 192GB 内存，可以尝试 70B
ollama run deepseek-r1:70b

# 设置并发数以充分利用 M4 Ultra 的算力
OLLAMA_NUM_PARALLEL=4 ollama serve

Cost Comparison with API Calls

Scenario 1: Individual Developer (~50K tokens/day)

Solution	Monthly Cost	Annual Cost	Notes
DeepSeek API	~$22	~$264	Pay-as-you-go, flexible
Ollama + RTX 4060	~$5 (electricity)	$60 + $300 (hardware)	Year 1: $360, then $60/year
Ollama + M4 Pro Mac	~$3 (electricity)	$36 + $2,399 (hardware)	Cost-effective long-term

Conclusion: For light individual use, API is more cost-effective.

Scenario 2: Small Team (~500K tokens/day)

Solution	Monthly Cost	Annual Cost	Notes
DeepSeek API	~$220	~$2,640	Stable, no maintenance
vLLM + RTX 4090	~$15 (electricity)	$180 + $1,600 (hardware)	ROI within 1 year
vLLM + A6000	~$20 (electricity)	$240 + $4,500 (hardware)	Larger models, better concurrency

Conclusion: For high-frequency use, local deployment recovers hardware costs within 1 year.

Scenario 3: Enterprise (~5M tokens/day)

Solution	Monthly Cost	Annual Cost	Notes
DeepSeek API	~$2,200	~$26,400	Possible rate limits
vLLM + 4x A100	~$200 (electricity)	$2,400 + $60,000 (hardware)	ROI within 2 years, full control
Cloud GPU (on-demand)	~$3,000	~$36,000	Flexible, no hardware maintenance

Conclusion: For enterprise-level high-frequency use, self-hosted inference clusters are most cost-effective long-term.

Cost Decision Tree

What is your daily token usage?
├── < 10K tokens → Use API, local deployment not worth it
├── 10K-100K tokens → Depends on privacy needs
│   ├── Need privacy → Local deployment (Ollama + consumer GPU)
│   └── No privacy concerns → API is more convenient
├── 100K-1M tokens → Local deployment starts having cost advantages
│   ├── Individual/Small team → Ollama + RTX 4090
│   └── Need high concurrency → vLLM + professional GPU
└── > 1M tokens → Strongly recommend local deployment
    ├── Medium budget → vLLM + multi-GPU consumer setup
    └── Ample budget → vLLM + A100/H100 cluster

Troubleshooting Common Issues

Issue 1: Slow Model Downloads

# 设置 Ollama 使用镜像源（中国用户）
export OLLAMA_HOST=https://ollama.mirrors.example.com

# 或者手动下载模型后导入
ollama create deepseek-r1:7b -f /path/to/Modelfile

If you have HuggingFace model files, you can also specify a local GGUF file path via a Modelfile.

Issue 2: CUDA Out of Memory (OOM)

# 降低 GPU 显存使用率
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --gpu-memory-utilization 0.8 \  # 从 0.9 降到 0.8
    --max-model-len 4096            # 减小上下文长度

# 或者使用更激进的量化
ollama run deepseek-r1:7b-q4_0  # 使用 Q4_0 代替 Q4_K_M

Issue 3: Slow Performance on Apple Silicon

# 确保使用 Metal 加速（Ollama 默认启用）
# 检查是否正确使用 GPU
ollama run deepseek-r1:7b --verbose

# 关闭其他占用内存的应用，释放更多统一内存给模型
# 在活动监视器中检查内存压力

# 增加 Ollama 使用的 GPU 层数
OLLAMA_NUM_GPU=999 ollama run deepseek-r1:7b

Issue 4: Docker Container Cannot Access GPU

# 安装 NVIDIA Container Toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# 验证 GPU 是否可用
docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi

# 如果仍有问题，检查 Docker daemon 配置
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Issue 5: Poor Model Output Quality

Check quantization precision: If using too-low quantization (e.g., 2-bit), quality degrades noticeably. Use at least Q4_K_M
Adjust temperature: Use 0.1-0.3 for code tasks, 0.5-0.7 for conversation, 0.8-1.0 for creative writing
Review system prompt: Ensure your system prompt is clear and specific
Increase context length: Some tasks require a longer context window

Issue 6: Slow Response Under Multi-User Concurrency

# Ollama 设置并发数
OLLAMA_NUM_PARALLEL=4 ollama serve

# vLLM 已内置高效的并发处理
# 可以通过增加 GPU 数量来提升并发能力
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --tensor-parallel-size 2 \
    --max-num-seqs 32  # 最大并发序列数

Summary

Method	Best For	Difficulty	Performance	Rating
Ollama	Individual developers, beginners	Easy	Good	Highly Recommended
vLLM	Production, high concurrency	Medium	Excellent	Recommended
Docker	Team collaboration, standardized deployment	Easy-Medium	Very Good	Recommended

Recommended path for beginners: Start with Ollama for quick experimentation, migrate to vLLM when you need better performance, and use Docker for standardized deployments.

Deploying DeepSeek models locally isn't complicated. Choose the right approach, and you can have a powerful AI model running on your device in minutes. Start your local AI journey today!

DeepSeek Local Deployment Complete Guide: From Beginner to Expert

DeepSeek Local Deployment Complete Guide: From Beginner to Expert

Why Deploy Locally?

Data Privacy & Security

Ultra-Low Latency

Long-Term Cost Advantage

Offline Availability

Hardware Requirements

NVIDIA GPUs

AMD GPUs

Apple Silicon

RAM Requirements

Method 1: Ollama Deployment (Simplest)

Installing Ollama

Download and Run DeepSeek Models

Using the Ollama API

Common Ollama Commands

Custom Modelfile

Method 2: vLLM Deployment (High-Performance Inference)

Installing vLLM

Starting the vLLM Inference Server

Advanced vLLM Configuration

Calling the vLLM API

vLLM vs Ollama Comparison

Method 3: Docker Deployment

Using the Ollama Docker Image

Docker Compose Orchestration

vLLM Docker Deployment

Choosing Quantization Versions

Quantization Precision Comparison

How to Choose?

Quantization in Ollama

Performance Benchmarks

DeepSeek-R1-7B (4-bit Quantization)

DeepSeek-R1-32B (4-bit Quantization)

DeepSeek-R1-70B (4-bit Quantization)

Performance on Apple Silicon (M4 Ultra)

Unique Advantages of M4 Ultra

M4 Ultra Benchmark Results

M4 Ultra Deployment Recommendations

Cost Comparison with API Calls

Scenario 1: Individual Developer (~50K tokens/day)

Scenario 2: Small Team (~500K tokens/day)

Scenario 3: Enterprise (~5M tokens/day)

Cost Decision Tree

Troubleshooting Common Issues

Issue 1: Slow Model Downloads

Issue 2: CUDA Out of Memory (OOM)

Issue 3: Slow Performance on Apple Silicon

Issue 4: Docker Container Cannot Access GPU

Issue 5: Poor Model Output Quality

Issue 6: Slow Response Under Multi-User Concurrency

Summary

Try DeepSeek Now