DeepSeek API Best Practices & Performance Optimization Guide

When deploying DeepSeek API in production environments, proper architectural design and performance optimization strategies are critical. This guide covers everything from basic API calls to advanced optimization techniques, helping you build efficient, stable, and cost-effective AI applications.

1. DeepSeek API Overview

DeepSeek offers multiple models for different use cases, all accessible through an OpenAI-compatible API format:

Model	Use Case	Context Length	Highlights
DeepSeek-V3	General chat, content generation	128K	Cost-effective general-purpose model
DeepSeek-R1	Complex reasoning, math proofs	128K	Deep chain-of-thought reasoning
DeepSeek-Coder	Code generation, code review	128K	Code-specialized model

All models share a unified API endpoint — switching models only requires changing the model parameter.

2. API Basics

2.1 Authentication Configuration

DeepSeek API uses Bearer Token authentication and is fully compatible with the OpenAI SDK:

from openai import OpenAI

# 初始化客户端，指向 DeepSeek API 端点
client = OpenAI(
    api_key="sk-your-api-key",
    base_url="https://api.deepseek.com"  # DeepSeek API 基础地址
)

import OpenAI from 'openai';

// 初始化客户端，配置 DeepSeek 端点
const client = new OpenAI({
  apiKey: 'sk-your-api-key',
  baseURL: 'https://api.deepseek.com',  // DeepSeek API 基础地址
});

2.2 Basic Call Example

# 基础对话调用
response = client.chat.completions.create(
    model="deepseek-chat",        # 使用 V3 模型
    messages=[
        {"role": "system", "content": "You are a professional technical assistant."},
        {"role": "user", "content": "Explain Python's GIL mechanism"}
    ],
    temperature=0.7,               # 控制输出随机性
    max_tokens=2048,               # 最大输出 token 数
    top_p=0.95                     # 核采样参数
)

print(response.choices[0].message.content)

Equivalent curl call:

curl https://api.deepseek.com/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "deepseek-chat",
    "messages": [
      {"role": "system", "content": "You are a professional technical assistant."},
      {"role": "user", "content": "Explain Python GIL mechanism"}
    ],
    "temperature": 0.7,
    "max_tokens": 2048
  }'

3. Prompt Engineering Best Practices

3.1 System Prompt Design Principles

# 结构化系统提示词示例
system_prompt = """You are a professional data analyst. Follow these rules:

## Role Definition
- Focus on data analysis and visualization recommendations
- Use professional yet accessible language

## Output Format
- Use Markdown format
- Include specific code examples
- Present key data in tables

## Constraints
- Do not fabricate data
- Clearly mark uncertain content
- Keep responses under 500 words
"""

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Analyze the trends in this sales data"}
    ]
)

3.2 Few-shot Optimization

# 使用 few-shot 提升输出一致性
messages = [
    {"role": "system", "content": "You are a JSON formatting assistant. Convert natural language to structured data."},
    # 示例 1
    {"role": "user", "content": "John, male, 28 years old, New York"},
    {"role": "assistant", "content": '{"name": "John", "gender": "male", "age": 28, "city": "New York"}'},
    # 示例 2
    {"role": "user", "content": "Jane, female, 35 years old, London"},
    {"role": "assistant", "content": '{"name": "Jane", "gender": "female", "age": 35, "city": "London"}'},
    # 实际查询
    {"role": "user", "content": "Bob, male, 42 years old, Tokyo"}
]

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    temperature=0       # 结构化输出建议使用低温度
)

3.3 Chain-of-Thought Prompting

# 使用 R1 模型进行深度推理
response = client.chat.completions.create(
    model="deepseek-reasoner",   # R1 推理模型
    messages=[
        {
            "role": "user",
            "content": """Please analyze the following problem step by step:

A pool has two inlet pipes and one outlet pipe.
Inlet A fills 3 cubic meters per hour, inlet B fills 5 cubic meters per hour.
The outlet drains 2 cubic meters per hour.
The pool capacity is 120 cubic meters.
Starting from empty, how many hours to fill the pool?

Show the complete reasoning process."""
        }
    ]
)

4. Streaming Output Implementation

4.1 Python Streaming

# 流式输出 - 逐 token 返回，降低首字延迟
stream = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "user", "content": "Write a short essay about artificial intelligence"}
    ],
    stream=True          # 启用流式输出
)

# 逐块处理响应
full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)  # 实时打印

4.2 Node.js Streaming

// 使用 async iterator 处理流式响应
async function streamChat(prompt) {
  const stream = await client.chat.completions.create({
    model: 'deepseek-chat',
    messages: [{ role: 'user', content: prompt }],
    stream: true,  // 启用流式输出
  });

  let fullResponse = '';
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    fullResponse += content;
    process.stdout.write(content);  // 实时输出到控制台
  }
  return fullResponse;
}

// 调用示例
streamChat('Implement a quicksort algorithm in JavaScript');

4.3 SSE (Server-Sent Events) Web Integration

from flask import Flask, Response
import json

app = Flask(__name__)

@app.route('/api/chat', methods=['POST'])
def chat_stream():
    """SSE 流式接口，适用于前端实时展示"""
    def generate():
        stream = client.chat.completions.create(
            model="deepseek-chat",
            messages=[{"role": "user", "content": "Hello"}],
            stream=True
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                # 按 SSE 格式推送数据
                yield f"data: {json.dumps({'content': content})}\n\n"
        yield "data: [DONE]\n\n"  # 结束标识

    return Response(generate(), mimetype='text/event-stream')

5. Batch Processing Optimization

5.1 Batch Request Handling

import asyncio
from openai import AsyncOpenAI

# 使用异步客户端实现批量请求
async_client = AsyncOpenAI(
    api_key="sk-your-api-key",
    base_url="https://api.deepseek.com"
)

async def process_batch(prompts: list[str], max_concurrent: int = 5):
    """批量处理多个请求，使用信号量控制并发数"""
    semaphore = asyncio.Semaphore(max_concurrent)  # 限制并发数

    async def single_request(prompt):
        async with semaphore:
            response = await async_client.chat.completions.create(
                model="deepseek-chat",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=1024
            )
            return response.choices[0].message.content

    # 并发执行所有请求
    tasks = [single_request(p) for p in prompts]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

# 使用示例
prompts = [
    "Summarize the core principles of quantum computing",
    "Explain blockchain consensus mechanisms",
    "Describe the backpropagation algorithm in neural networks",
    "Introduce basic concepts of reinforcement learning",
    "Explain the attention mechanism in Transformer architecture"
]

results = asyncio.run(process_batch(prompts, max_concurrent=3))
for i, result in enumerate(results):
    print(f"--- Question {i+1} ---")
    print(result[:200])  # 打印前200字符

5.2 JSONL Batch File Format

import json

def create_batch_file(requests: list[dict], output_path: str):
    """创建 JSONL 格式的批处理文件"""
    with open(output_path, 'w', encoding='utf-8') as f:
        for i, req in enumerate(requests):
            batch_item = {
                "custom_id": f"request-{i}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "deepseek-chat",
                    "messages": req["messages"],
                    "max_tokens": req.get("max_tokens", 1024)
                }
            }
            f.write(json.dumps(batch_item, ensure_ascii=False) + "\n")

# 构造批处理请求列表
batch_requests = [
    {"messages": [{"role": "user", "content": f"Translate to Chinese: {text}"}]}
    for text in ["Hello World", "Artificial Intelligence", "Deep Learning", "NLP"]
]

create_batch_file(batch_requests, "batch_input.jsonl")

6. Function Calling / Tool Use

6.1 Defining Tool Functions

# 定义可供模型调用的工具
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather information for a specified city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g., Beijing, New York"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search product information in the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search keywords"
                    },
                    "category": {
                        "type": "string",
                        "description": "Product category"
                    },
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum number of results"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

6.2 Complete Tool Use Workflow

import json

def get_weather(city: str, unit: str = "celsius") -> dict:
    """模拟天气查询接口"""
    return {"city": city, "temperature": 22, "unit": unit, "condition": "sunny"}

def search_database(query: str, category: str = None, max_results: int = 5) -> list:
    """模拟数据库查询"""
    return [{"name": f"{query} related product", "price": 99.9, "category": category}]

# 工具函数映射表
tool_functions = {
    "get_weather": get_weather,
    "search_database": search_database,
}

def run_with_tools(user_message: str):
    """带工具调用的完整对话流程"""
    messages = [{"role": "user", "content": user_message}]

    # 第一次调用：让模型决定是否使用工具
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        tools=tools,
        tool_choice="auto"   # 自动决定是否调用工具
    )

    assistant_message = response.choices[0].message

    # 检查是否有工具调用
    if assistant_message.tool_calls:
        messages.append(assistant_message)

        # 执行每个工具调用
        for tool_call in assistant_message.tool_calls:
            func_name = tool_call.function.name
            func_args = json.loads(tool_call.function.arguments)

            # 调用对应的工具函数
            result = tool_functions[func_name](**func_args)

            # 将工具结果加入消息列表
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result, ensure_ascii=False)
            })

        # 第二次调用：模型根据工具结果生成最终回答
        final_response = client.chat.completions.create(
            model="deepseek-chat",
            messages=messages,
            tools=tools
        )
        return final_response.choices[0].message.content

    return assistant_message.content

# 使用示例
print(run_with_tools("What's the weather in Beijing? Also search for sunscreen products"))

6.3 Node.js Tool Use Implementation

// 工具定义
const tools = [
  {
    type: 'function',
    function: {
      name: 'calculate',
      description: 'Perform mathematical calculations',
      parameters: {
        type: 'object',
        properties: {
          expression: { type: 'string', description: '数学表达式' }
        },
        required: ['expression']
      }
    }
  }
];

// 工具函数实现
const toolFunctions = {
  calculate: ({ expression }) => {
    try {
      // 安全的数学表达式计算
      const result = Function(`"use strict"; return (${expression})`)();
      return { result, expression };
    } catch (e) {
      return { error: 'Calculation failed', expression };
    }
  }
};

// 带工具调用的对话函数
async function chatWithTools(userMessage) {
  const messages = [{ role: 'user', content: userMessage }];

  const response = await client.chat.completions.create({
    model: 'deepseek-chat',
    messages,
    tools,
    tool_choice: 'auto',  // 自动决定是否调用工具
  });

  const assistantMsg = response.choices[0].message;

  if (assistantMsg.tool_calls) {
    messages.push(assistantMsg);

    // 逐个执行工具调用
    for (const toolCall of assistantMsg.tool_calls) {
      const args = JSON.parse(toolCall.function.arguments);
      const result = toolFunctions[toolCall.function.name](args);
      messages.push({
        role: 'tool',
        tool_call_id: toolCall.id,
        content: JSON.stringify(result),
      });
    }

    // 将工具结果发回模型获取最终回答
    const finalResponse = await client.chat.completions.create({
      model: 'deepseek-chat',
      messages,
      tools,
    });
    return finalResponse.choices[0].message.content;
  }

  return assistantMsg.content;
}

7. Rate Limiting & Retry Strategies

7.1 Exponential Backoff Retry

import time
import random
from openai import RateLimitError, APITimeoutError, APIConnectionError

def call_with_retry(
    func,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0
):
    """带指数退避的重试装饰器"""
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError as e:
            # 速率限制：使用较长的退避时间
            delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
            print(f"Rate limited. Waiting {delay:.1f}s before retry ({attempt+1}/{max_retries})")
            time.sleep(delay)
        except APITimeoutError:
            # 超时：较短的退避
            delay = min(base_delay * (1.5 ** attempt), max_delay)
            print(f"Timeout. Waiting {delay:.1f}s before retry")
            time.sleep(delay)
        except APIConnectionError:
            # 连接错误：中等退避
            delay = min(base_delay * (2 ** attempt), max_delay)
            print(f"Connection error. Waiting {delay:.1f}s before retry")
            time.sleep(delay)

    raise Exception(f"Failed after {max_retries} retries")

# 使用示例
result = call_with_retry(
    lambda: client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "user", "content": "Hello"}],
        timeout=30  # 30秒超时
    )
)

7.2 Advanced Retry with tenacity

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
from openai import RateLimitError, APITimeoutError

@retry(
    stop=stop_after_attempt(5),                      # 最多重试5次
    wait=wait_exponential(multiplier=1, max=60),     # 指数退避，最大60秒
    retry=retry_if_exception_type(                   # 仅对特定异常重试
        (RateLimitError, APITimeoutError)
    ),
    before_sleep=lambda info: print(               # 重试前打印日志
        f"Retrying after {info.idle_for:.1f}s..."
    )
)
def reliable_api_call(messages: list, model: str = "deepseek-chat"):
    """带自动重试的可靠 API 调用"""
    return client.chat.completions.create(
        model=model,
        messages=messages,
        timeout=30
    )

7.3 Token Bucket Rate Limiter

import time
import threading

class TokenBucketRateLimiter:
    """令牌桶算法限速器，控制 API 请求频率"""

    def __init__(self, rate: float, capacity: int):
        self.rate = rate              # 每秒补充的令牌数
        self.capacity = capacity      # 桶的最大容量
        self.tokens = capacity        # 当前令牌数
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self):
        """获取一个令牌，如果没有可用令牌则等待"""
        while True:
            with self.lock:
                now = time.monotonic()
                elapsed = now - self.last_refill
                self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
                self.last_refill = now

                if self.tokens >= 1:
                    self.tokens -= 1
                    return
            time.sleep(0.1)

# 使用示例：每秒最多 10 个请求
limiter = TokenBucketRateLimiter(rate=10, capacity=10)

def rate_limited_call(messages):
    """受速率限制的 API 调用"""
    limiter.acquire()
    return client.chat.completions.create(
        model="deepseek-chat",
        messages=messages
    )

8. Cost Optimization Tips

8.1 Prompt Caching

# 利用前缀缓存降低重复请求成本
# 固定的系统提示词作为缓存前缀
CACHED_SYSTEM_PROMPT = """You are a professional customer service assistant handling:
1. Product inquiries
2. Order tracking
3. After-sales service
4. Complaints and suggestions

Always maintain a polite and professional tone. For issues you cannot handle,
guide users to contact human support.
"""

def customer_service_chat(user_message: str, conversation_history: list = None):
    """客服对话 - 利用固定前缀实现缓存复用"""
    messages = [{"role": "system", "content": CACHED_SYSTEM_PROMPT}]

    if conversation_history:
        messages.extend(conversation_history)

    messages.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        max_tokens=512      # 限制输出长度控制成本
    )

    # 检查缓存命中情况
    usage = response.usage
    print(f"Input tokens: {usage.prompt_tokens}")
    print(f"Output tokens: {usage.completion_tokens}")
    if hasattr(usage, 'prompt_cache_hit_tokens'):
        print(f"Cache hit tokens: {usage.prompt_cache_hit_tokens}")

    return response.choices[0].message.content

8.2 Prompt Compression Strategies

def compress_prompt(text: str, max_length: int = 2000) -> str:
    """压缩长文本，减少 token 消耗"""
    if len(text) <= max_length:
        return text

    # 策略1：使用模型自身进行摘要压缩
    summary_response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": "Compress the following content into a concise summary, preserving key information:"},
            {"role": "user", "content": text}
        ],
        max_tokens=500,
        temperature=0
    )
    return summary_response.choices[0].message.content

def smart_context_window(messages: list, max_tokens: int = 4000) -> list:
    """智能上下文窗口管理，保留最重要的消息"""
    if not messages:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    # 如果消息过多，保留首尾部分
    if len(non_system) > 10:
        trimmed = non_system[:2] + [
            {"role": "system", "content": "[Earlier conversation omitted]"}
        ] + non_system[-6:]
        return system_msgs + trimmed

    return messages

8.3 Model Selection Strategy

def smart_model_selection(query: str) -> str:
    """根据任务复杂度自动选择最优模型，平衡性能与成本"""

    # 简单任务关键词
    simple_keywords = ["translate", "summarize", "rewrite", "format", "extract"]
    # 复杂任务关键词
    complex_keywords = ["prove", "derive", "analyze", "design architecture", "math"]
    # 代码任务关键词
    code_keywords = ["code", "programming", "debug", "refactor", "implement function"]

    query_lower = query.lower()

    if any(kw in query_lower for kw in code_keywords):
        return "deepseek-coder"           # 代码任务用 Coder
    elif any(kw in query_lower for kw in complex_keywords):
        return "deepseek-reasoner"        # 复杂推理用 R1
    else:
        return "deepseek-chat"            # 通用任务用 V3

# 模型定价对比表
MODEL_PRICING = {
    "deepseek-chat": {
        "input": 0.27,       # 每百万 token 输入价格（元）
        "output": 1.10,      # 每百万 token 输出价格（元）
        "cache_hit": 0.07    # 缓存命中价格
    },
    "deepseek-reasoner": {
        "input": 0.55,
        "output": 2.19,
        "cache_hit": 0.14
    }
}

def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    """估算单次调用成本（元）"""
    pricing = MODEL_PRICING.get(model, MODEL_PRICING["deepseek-chat"])
    cost = (input_tokens / 1_000_000 * pricing["input"] +
            output_tokens / 1_000_000 * pricing["output"])
    return round(cost, 6)

9. Error Handling and Monitoring

9.1 Comprehensive Error Handling

from openai import (
    APIError,
    AuthenticationError,
    RateLimitError,
    APITimeoutError,
    BadRequestError,
    APIConnectionError
)
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("deepseek_api")

def robust_api_call(messages: list, **kwargs):
    """健壮的 API 调用，包含完整错误处理"""
    try:
        response = client.chat.completions.create(
            model=kwargs.get("model", "deepseek-chat"),
            messages=messages,
            **{k: v for k, v in kwargs.items() if k != "model"}
        )
        logger.info(
            f"API call success | Input: {response.usage.prompt_tokens} tokens "
            f"| Output: {response.usage.completion_tokens} tokens"
        )
        return response

    except AuthenticationError:
        logger.error("Authentication failed: Please verify your API key")
        raise
    except RateLimitError as e:
        logger.warning(f"Rate limit triggered: {e.message}")
        raise
    except BadRequestError as e:
        logger.error(f"Bad request: {e.message}")
        raise
    except APITimeoutError:
        logger.warning("Request timed out. Consider reducing input length or increasing timeout")
        raise
    except APIConnectionError:
        logger.error("Connection failed. Check network and API endpoint configuration")
        raise
    except APIError as e:
        logger.error(f"API error (status {e.status_code}): {e.message}")
        raise

9.2 Metrics Collection and Monitoring

import time
from dataclasses import dataclass, field

@dataclass
class APIMetrics:
    """API 调用指标收集器"""
    total_calls: int = 0
    successful_calls: int = 0
    failed_calls: int = 0
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_latency: float = 0.0
    errors: dict = field(default_factory=dict)

    @property
    def avg_latency(self) -> float:
        """平均延迟（秒）"""
        return self.total_latency / max(self.total_calls, 1)

    @property
    def success_rate(self) -> float:
        """成功率"""
        return self.successful_calls / max(self.total_calls, 1)

    def report(self) -> str:
        """生成监控报告"""
        return f"""
=== DeepSeek API Monitoring Report ===
Total Calls: {self.total_calls}
Success Rate: {self.success_rate:.1%}
Avg Latency: {self.avg_latency:.2f}s
Total Input Tokens: {self.total_input_tokens:,}
Total Output Tokens: {self.total_output_tokens:,}
Error Distribution: {self.errors}
"""

metrics = APIMetrics()

def monitored_call(messages: list, **kwargs):
    """带监控的 API 调用"""
    metrics.total_calls += 1
    start = time.time()

    try:
        response = client.chat.completions.create(
            model=kwargs.get("model", "deepseek-chat"),
            messages=messages,
            **{k: v for k, v in kwargs.items() if k != "model"}
        )
        metrics.successful_calls += 1
        metrics.total_input_tokens += response.usage.prompt_tokens
        metrics.total_output_tokens += response.usage.completion_tokens
        return response

    except Exception as e:
        metrics.failed_calls += 1
        error_type = type(e).__name__
        metrics.errors[error_type] = metrics.errors.get(error_type, 0) + 1
        raise
    finally:
        metrics.total_latency += time.time() - start

10. LangChain Integration

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# 初始化 DeepSeek 模型（兼容 LangChain 的 OpenAI 接口）
llm = ChatOpenAI(
    model="deepseek-chat",
    openai_api_key="sk-your-api-key",
    openai_api_base="https://api.deepseek.com",
    temperature=0.7,
    max_tokens=2048,
    streaming=True            # 启用流式输出
)

# 使用 Prompt Template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a {role}. Answer questions in a {style} manner."),
    ("human", "{question}")
])

# 创建调用链
chain = prompt | llm

# 同步调用
result = chain.invoke({
    "role": "tech expert",
    "style": "concise and professional",
    "question": "What are the pros and cons of microservice architecture?"
})
print(result.content)

# 流式调用
async def stream_langchain():
    async for chunk in chain.astream({
        "role": "tech expert",
        "style": "concise and professional",
        "question": "What are the pros and cons of microservice architecture?"
    }):
        print(chunk.content, end="", flush=True)

11. LlamaIndex Integration

from llama_index.llms.openai_like import OpenAILike
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

# 配置 DeepSeek 作为 LlamaIndex 的 LLM
llm = OpenAILike(
    model="deepseek-chat",
    api_base="https://api.deepseek.com",
    api_key="sk-your-api-key",
    is_chat_model=True,
    temperature=0.7,
    max_tokens=2048
)

# 设置全局默认 LLM
Settings.llm = llm

# 构建 RAG 管道
documents = SimpleDirectoryReader("./data").load_data()  # 加载文档
index = VectorStoreIndex.from_documents(documents)       # 构建索引
query_engine = index.as_query_engine(                    # 创建查询引擎
    similarity_top_k=3,            # 检索前3个相关片段
    streaming=True                 # 启用流式输出
)

# 执行查询
response = query_engine.query("What are the features of DeepSeek V3's MoE architecture?")
print(response)

12. Latency Optimization

12.1 Context Length Control

import tiktoken

def count_tokens(text: str, model: str = "deepseek-chat") -> int:
    """估算文本的 token 数量"""
    encoder = tiktoken.get_encoding("cl100k_base")
    return len(encoder.encode(text))

def optimize_context(messages: list, max_context_tokens: int = 8000) -> list:
    """优化上下文长度，减少延迟"""
    total_tokens = sum(count_tokens(m["content"]) for m in messages)

    if total_tokens <= max_context_tokens:
        return messages

    optimized = []
    system_msg = None
    remaining_tokens = max_context_tokens

    # 保留系统消息
    for msg in messages:
        if msg["role"] == "system":
            system_msg = msg
            remaining_tokens -= count_tokens(msg["content"])
            break

    if system_msg:
        optimized.append(system_msg)

    # 从最新消息开始，逆序添加直到达到限制
    non_system = [m for m in messages if m["role"] != "system"]
    for msg in reversed(non_system):
        msg_tokens = count_tokens(msg["content"])
        if remaining_tokens >= msg_tokens:
            optimized.insert(len(optimized), msg)
            remaining_tokens -= msg_tokens
        else:
            break

    return optimized

12.2 Concurrent Request Optimization

import asyncio
import aiohttp

class DeepSeekBatchClient:
    """高性能批量请求客户端"""

    def __init__(self, api_key: str, max_concurrent: int = 10):
        self.api_key = api_key
        self.base_url = "https://api.deepseek.com"
        self.semaphore = asyncio.Semaphore(max_concurrent)  # 并发控制
        self.session = None

    async def __aenter__(self):
        connector = aiohttp.TCPConnector(
            limit=20,               # 最大连接数
            keepalive_timeout=30     # 连接保活时间
        )
        self.session = aiohttp.ClientSession(
            connector=connector,
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            }
        )
        return self

    async def __aexit__(self, *args):
        if self.session:
            await self.session.close()

    async def _single_request(self, payload: dict) -> dict:
        """执行单个请求，使用信号量控制并发"""
        async with self.semaphore:
            async with self.session.post(
                f"{self.base_url}/chat/completions",
                json=payload,
                timeout=aiohttp.ClientTimeout(total=60)
            ) as response:
                return await response.json()

    async def batch_complete(self, prompts: list[str], model: str = "deepseek-chat") -> list:
        """批量完成请求"""
        payloads = [
            {
                "model": model,
                "messages": [{"role": "user", "content": p}],
                "max_tokens": 1024
            }
            for p in prompts
        ]

        tasks = [self._single_request(p) for p in payloads]
        return await asyncio.gather(*tasks, return_exceptions=True)

# 使用示例
async def main():
    async with DeepSeekBatchClient("sk-your-api-key", max_concurrent=5) as client:
        prompts = [f"Briefly explain: {topic}" for topic in [
            "Machine Learning", "Deep Learning", "Reinforcement Learning",
            "Transfer Learning", "Federated Learning"
        ]]
        results = await client.batch_complete(prompts)
        for r in results:
            if isinstance(r, dict) and "choices" in r:
                print(r["choices"][0]["message"]["content"][:100])

asyncio.run(main())

12.3 Latency Comparison & Optimization Recommendations

Optimization	Latency Reduction	Cost Impact	Implementation Difficulty
Enable streaming	80% lower TTFT	None	Low
Reduce context length	20-50%	Lower	Medium
Use prompt caching	10-30%	Lower	Low
Concurrent requests	5-10x throughput	None	Medium
Select appropriate model	30-60%	Lower	Low
Control max_tokens	10-40%	Lower	Low
Connection pool reuse	5-15%	None	Low

Summary

Efficient use of DeepSeek API in production environments requires considering several dimensions:

Model Selection: Match the right model (V3 / R1 / Coder) to task complexity
Prompt Engineering: Structured prompt design, few-shot examples, chain-of-thought reasoning
Performance Optimization: Streaming output, concurrency control, context management
Cost Control: Cache utilization, prompt compression, output length limits
Reliability: Retry strategies, error handling, monitoring and alerting

By properly applying these best practices, you can build efficient, stable, and economical AI applications. Start with small-scale validation, gradually apply optimization strategies, and continuously monitor key metrics to ensure your system always runs at optimal performance.

DeepSeek API Best Practices & Performance Optimization Guide

DeepSeek API Best Practices & Performance Optimization Guide

1. DeepSeek API Overview

2. API Basics

2.1 Authentication Configuration

2.2 Basic Call Example

3. Prompt Engineering Best Practices

3.1 System Prompt Design Principles

3.2 Few-shot Optimization

3.3 Chain-of-Thought Prompting

4. Streaming Output Implementation

4.1 Python Streaming

4.2 Node.js Streaming

4.3 SSE (Server-Sent Events) Web Integration

5. Batch Processing Optimization

5.1 Batch Request Handling

5.2 JSONL Batch File Format

6. Function Calling / Tool Use

6.1 Defining Tool Functions

6.2 Complete Tool Use Workflow

6.3 Node.js Tool Use Implementation

7. Rate Limiting & Retry Strategies

7.1 Exponential Backoff Retry

7.2 Advanced Retry with tenacity

7.3 Token Bucket Rate Limiter

8. Cost Optimization Tips

8.1 Prompt Caching

8.2 Prompt Compression Strategies

8.3 Model Selection Strategy

9. Error Handling and Monitoring

9.1 Comprehensive Error Handling

9.2 Metrics Collection and Monitoring

10. LangChain Integration

11. LlamaIndex Integration

12. Latency Optimization

12.1 Context Length Control

12.2 Concurrent Request Optimization

12.3 Latency Comparison & Optimization Recommendations

Summary

Try DeepSeek Now