部署满血DeepSeek R1的避坑指南-vLLM 0.7.1

发布于 2025-2-6 15:33

浏览

0收藏

今天看到vLLM的朋友圈发布了DeepSeek R1的PP支持，立刻开始我的捣鼓之旅，假如我训练的超大MoE上线了，也得做好技术准备工作是不嘛。把踩坑经验给大家分享一下，希望能够相比于官方文档更白话一点。

Distributed Inference and Serving: https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes1.

知乎@游凯超说一定要让整个过程变得丝滑无比，我俩配合做了几个验证，现在应该只需要 Step0 和 Step3 就可以run起来了，如果遇到autoscalar的相关问题可以看Step1可以解决。

Step 0 Prepare weights & Environment

由于权重太大了，即使你网速可以，也不建议直连下载了。大家可以先从HF及或代理弄一份权重回来，直连大概率直接超时或者把公网IP打爆。我们今天展示的多机多卡8xH20 (x2) 部署，对应TP size 8，PP size 2，所以要搞两台这样的机器过来。同时有一个假设：两机的网络互通，不一定需要IB，储存需要共享（NAS或OSS均可），完成准备工作之后便可以做第一步。

Step 1 Setup up Ray & Cluster

官方文档里面简单带过了这一部分，但这个是我被卡时间太久的问题。首先我说一下官方文档的意思，就是让你准备好两个节点，之间用ray start这个CLI去建立好ray集群。因为后面要用，但是比较坑的有两点，第一点是启动的命令似乎有点点问题，我在前几次尝试的时候都遇到了Ray的autoscaler报错的问题：

(autoscaler +1m19s) Error: No available node types can fulfill resource request {'node:33.18.26.153': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +1m54s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +2m29s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.
INFO 02-02 09:39:14 ray_utils.py:212] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:33.18.26.153': 0.001, 'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.1.
2.
3.
4.

这看起来就很奇怪，因为vLLM找Ray集群要的Resource是custom resource，'node:33.18.26.153':0.001，这可以理解成vLLM优先要driver节点。但是这个东西我印象中是需要启动ray的时候自己设置的：

https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources1.

像这样才会有这种resource。背后的原因是对于多（虚拟）网卡的机器会有多个网段，vLLM assume使用POD IP来做Ray的master寻址。

解法1：设置 VLLM_HOST_IP

# Get local IP address and set on every node before Ray start
VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
export VLLM_HOST_IP1.
2.
3.

解法2：魔改Ray启动逻辑

def get_actual_ip():
    """Get the actual IP address of the current machine."""
    try:
        # Create a socket to connect to an external server (doesn't actually connect)
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(('8.8.8.8', 80))
        ip = s.getsockname()[0]
        s.close()
        return ip
    except Exception:
        # Fallback to hostname-based IP resolution
        return socket.gethostbyname(socket.gethostname())

def start_ray_cluster():
    free_ports = get_free_ports()
    port = free_ports[0]
    node_manager_port = free_ports[1]
    master_addr = get_master_addr()
    rank = get_rank()
    node_ip = get_actual_ip()  # Use the new function to get actual IP
    
    # Define custom resource based on node IP
    resource_spec = f'--resources=\'{{"node:{node_ip}": 1}}\''
    
    if rank == 0:
        cmd = f"ray start --head --port={port} --node-ip-address={master_addr} --node-manager-port {node_manager_port} --node-name={master_addr} {resource_spec}"
    else:
        cmd = f"ray start --address={master_addr}:{port} --node-manager-port {node_manager_port} --node-name={get_addr()} {resource_spec}"
    
    if ray.is_initialized():
        print("Ray is already initialized, skipping node level init.")
    else:
        stop_cmd = "ray stop"
        execute(stop_cmd, check=True)
        print(f"Executing Ray start command: {cmd}")
        execute(cmd, check=True)1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.

其中execute可以这样写，

import time
import subprocess

def execute(cmd, check=False, retry=1):
    ret = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=check)
    state = ret.returncode == 0
    msg = ret.stdout if state else ret.stderr
    if not state and retry > 1:
        print(f"execute {cmd} got error {msg}, retry...")
        time.sleep(1)
        return execute(cmd, check, retry-1)
    return state, msg1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

然后这里我稍微提一下ray的一些基础玩法：大家在使用Ray的时候一般都不是在裸机上面的，大部分深度学习的资源都是k8s结合kubeflow或者volcano这样的插件分发出来的。环境变量里面会有当前是第几个rank，头结点master_addr这样的信息，大家可以根据自己的需要把这些函数实现一下。比较坑的 {resource_spec} 这里我已经替大家把坑给填了。

Step 2 Other small bugs

期间又报了两个错误，花了一点时间修复：

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 5, in <module>
    from vllm.scripts import main
  File "/usr/local/lib/python3.10/dist-packages/vllm/__init__.py", line 4, in <module>
    from vllm.engine.async_llm_engine import AsyncLLMEngine
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 15, in <module>
    from vllm.engine.llm_engine import (DecoderPromptComponents, LLMEngine,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 24, in <module>
    from vllm.engine.output_processor.interfaces import (
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/interfaces.py", line 6, in <module>
    from vllm.engine.output_processor.stop_checker import StopChecker
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/stop_checker.py", line 6, in <module>
    from vllm.transformers_utils.tokenizer import AnyTokenizer
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizer.py", line 13, in <module>
    from vllm.transformers_utils.tokenizers import (BaichuanTokenizer,
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizers/__init__.py", line 2, in <module>
    from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizers/mistral.py", line 9, in <module>
    from mistral_common.tokens.tokenizers.mistral import ChatCompletionRequest
  File "/usr/local/lib/python3.10/dist-packages/mistral_common/tokens/tokenizers/mistral.py", line 32, in <module>
    from mistral_common.tokens.tokenizers.multimodal import (
  File "/usr/local/lib/python3.10/dist-packages/mistral_common/tokens/tokenizers/multimodal.py", line 6, in <module>
    import cv2
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 175, in bootstrap
    if __load_extra_py_code_for_module("cv2", submodule, DEBUG):
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 28, in __load_extra_py_code_for_module
    py_module = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.10/dist-packages/cv2/typing/__init__.py", line 171, in <module>
    LayerId = cv2.dnn.DictValue
AttributeError: module 'cv2.dnn' has no attribute 'DictValue'1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.

一个opencv封建余孽的问题，pin住opencv的版本来解决

pip install opencv-python-headless==4.5.4.581.

还有一个load之后报TypeError的问题

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 472, in forward
[rank0]:     kv_c, k_pe = self.kv_a_proj_with_mqa(hidden_states)[0].split(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 246, in forward
[rank0]:     output = self.quant_method.apply(self, x, bias)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 357, in apply
[rank0]:     return apply_w8a8_block_fp8_linear(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 61, in apply_w8a8_block_fp8_linear
[rank0]:     output = w8a8_block_fp8_matmul(q_input,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 470, in w8a8_block_fp8_matmul
[rank0]:     configs = get_w8a8_block_fp8_configs(N, K, block_size[0], block_size[1])
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 407, in get_w8a8_block_fp8_configs
[rank0]:     device_name = current_platform.get_device_name().replace(" ", "_")
[rank0]: TypeError: a bytes-like object is required, not 'str'1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.

通过升级 pynvml 解决

pip install pynvml -U1.

Step 3 Run the model

这一步反而是最简单的：

vllm serve /your/path/to_checkpoint_deepseek-r1/ --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --host 0.0.0.01.

由于有了PP加持，没有IB的同学也可以尝试把sequence length和bsz给稍微拉大一些拉。用gaoce哥哥贡献的Reasoning Output，在同一台机器来试一把，或者换一台机器把localhost改了：

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.

对，你不是卡主了，是你的钱包不够厚。切到后台可以看到，这个prompt里面

INFO 02-02 14:18:52 metrics.py:453] Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:18:57 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:02 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:07 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:12 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:17 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:22 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:27 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.1.
2.
3.
4.
5.
6.
7.
8.

稍等一会他就会告诉你9.8更大了。

祝大家捣鼓顺利，感谢vLLM社区的工作。

https://github.com/vllm-project/vllm/pull/126791.

凯超真 nb 春节在这做贴身客服，哈哈，RL仔现在不管原来是主修文还是主修理的，都先修infra吧。

本文转载自 NLP工作站，作者：曹宇

标签

51CTO

51CTO博客

51CTO学堂

部署满血DeepSeek R1的避坑指南-vLLM 0.7.1

Step 0 Prepare weights & Environment

Step 1 Setup up Ray & Cluster

Step 2 Other small bugs

Step 3 Run the model

目录