Merge branch 'PaddlePaddle:develop' into feat/blackwell-sm100-support

celsowm · web-flow · commit 699487c4cefd · 2025-07-01T09:52:16.000-03:00
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -36,6 +36,8 @@ jobs:
               rm -rf ${REPO_NAME}
             fi
           '
+          git config --global user.name "FastDeployCI"
+          git config --global user.email "fastdeploy_ci@example.com"
           git clone ${REPO} ${REPO_NAME}
           cd FastDeploy
           if [ "${{ github.event_name }}" = "pull_request" ]; then
diff --git a/docs/get_started/installation/kunlunxin_xpu.md b/docs/get_started/installation/kunlunxin_xpu.md
@@ -23,7 +23,13 @@ Verified platform:
 ## 1. Set up using Docker (Recommended)
 
 ```bash
+mkdir Work
+cd Work
 docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
+docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
+    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \
+    /bin/bash
+docker exec -it fastdeploy-xpu /bin/bash
 ```
 
 ## 2. Set up using pre-built wheels
@@ -43,13 +49,13 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
 ### Install FastDeploy (**Do NOT install via PyPI source**)
 
 ```bash
-python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
+python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 ```
 
 Alternatively, you can install the latest version of FastDeploy (Not recommended)
 
 ```bash
-python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 ```
 
 ## 3. Build wheel from source
@@ -99,55 +105,29 @@ The compiled outputs will be located in the ```FastDeploy/dist``` directory.
 
 ## Installation verification
 
-```python
-import paddle
-from paddle.jit.marker import unified
-paddle.utils.run_check()
-from fastdeploy.model_executor.ops.xpu import block_attn
+```bash
+python -c "import paddle; paddle.version.show()"
+python -c "import paddle; paddle.utils.run_check()"
+python -c "from paddle.jit.marker import unified"
+python -c "from fastdeploy.model_executor.ops.xpu import block_attn"
 ```
 
 If all the above steps execute successfully, FastDeploy is installed correctly.
 
 ## Quick start
 
-Currently, P800 has only validated deployment of the following models:
-- ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card)
-- ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)
-
-### Offline inference
-
-After installing FastDeploy, you can perform offline text generation with user-provided prompts using the following code,
-
-```python
-from fastdeploy import LLM, SamplingParams
-
-prompts = [
-    "Where is the capital of China?",
-]
-
-sampling_params = SamplingParams(top_p=0.95)
-
-llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
-
-outputs = llm.generate(prompts, sampling_params)
-
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs.text
-
-    print(f"Prompt: {prompt}")
-    print(f"Generated text: {generated_text}")
-```
-
-Refer to [Parameters](../../parameters.md) for more configuration options.
+The P800 supports the deployment of the ```ERNIE-4.5-300B-A47B-Paddle``` model using the following configurations (Note: Different configurations may result in variations in performance).
+- 32K WINT4 with 8 XPUs (Recommended)
+- 128K WINT4 with 8 XPUs
+- 32K WINT4 with 4 XPUs
 
 ### Online serving (OpenAI API-Compatible server)
 
 Deploy an OpenAI API-compatible server using FastDeploy with the following commands:
 
 #### Start service
 
-**ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card) (Recommended)**
+**Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 8 XPUs(Recommended)**
 
 ```bash
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -160,7 +140,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
     --gpu-memory-utilization 0.9
 ```
 
-**ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)**
+**Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 128K context length on 8 XPUs**
 
 ```bash
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -173,6 +153,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
     --gpu-memory-utilization 0.9
 ```
 
+**Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 4 XPUs**
+
+```bash
+export XPU_VISIBLE_DEVICES="0,1,2,3"
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8188 \
+    --tensor-parallel-size 4 \
+    --max-model-len 32768 \
+    --max-num-seqs 64 \
+    --quantization "wint4" \
+    --gpu-memory-utilization 0.9
+```
+
 Refer to [Parameters](../../parameters.md) for more options.
 
 #### Send requests
@@ -207,7 +201,6 @@ print('\n')
 response = client.chat.completions.create(
     model="null",
     messages=[
-        {"role": "system", "content": "I'm a helpful AI assistant."},
         {"role": "user", "content": "Where is the capital of China?"},
     ],
     stream=True,
diff --git a/docs/get_started/installation/nvidia_gpu.md b/docs/get_started/installation/nvidia_gpu.md
@@ -9,6 +9,9 @@ The following installation methods are available when your environment meets the
 - Linux X86_64
 
 ## 1. Pre-built Docker Installation (Recommended)
+
+**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800)，if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdpeloy-gpu``` after you create the container.
+
 ```shell
 docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
 ```
diff --git a/docs/quantization/online_quantization.md b/docs/quantization/online_quantization.md
@@ -22,7 +22,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --max-num-seqs 32
 ```
 
-- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md).
+- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](../supported_models.md).
 - By setting `--quantization` to `wint8` or `wint4`, online INT8/INT4 quantization can be selected.
 - Deploying ERNIE-4.5-300B-A47B-Paddle WINT8 requires at least 80G * 8 cards, while WINT4 requires 80GB * 4 cards.
 - For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md).
@@ -48,7 +48,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --max-num-seqs 32
 ```
 
-- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md).
+- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](../supported_models.md).
 - By setting `--quantization` to `block_wise_fp8`, online Block-wise FP8 quantization can be selected.
 - Deploying ERNIE-4.5-300B-A47B-Paddle Block-wise FP8 requires at least 80G * 8 cards.
 - For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md) 
diff --git a/docs/quantization/wint2.md b/docs/quantization/wint2.md
@@ -46,7 +46,7 @@ Example of quantization configuration in the model's config.json file:
 ```
 
 - For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md);
-- For more model descriptions, please refer to [Supported Model List](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md).
+- For more model descriptions, please refer to [Supported Model List](../supported_models.md).
 
 ## WINT2 Performance
 
diff --git a/docs/zh/features/speculative_decoding.md b/docs/zh/features/speculative_decoding.md
@@ -108,13 +108,13 @@ python -m fastdeploy.entrypoints.openai.api_server  \
 ```
 
 ## 🧠 使用 Ngram 解码
-该算法通过 n-gram 窗口从 prompt 和已生成的 Token 中进行匹配生成草稿 Token，适合输入和输出有很大 overlap 的场景如代码编辑、文档查询等查看论文地址。
+该算法通过 n-gram 窗口从 prompt 和已生成的 Token 中进行匹配生成草稿 Token，适合输入和输出有很大 overlap 的场景，如代码续写、文档查询等。
 > 使用 4×H100；量化方式选择 WINT4  
 > 配置文件：benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml
 ```
 python -m fastdeploy.entrypoints.openai.api_server \
     --model ${path_to_main_model} \
     --tensor-parallel-size 4 \
     --config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
-    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
+    --speculative-config '{"method": "ngram", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
 ```
diff --git a/docs/zh/get_started/installation/kunlunxin_xpu.md b/docs/zh/get_started/installation/kunlunxin_xpu.md
@@ -23,7 +23,13 @@
 ## 1. 使用 Docker 安装（推荐）
 
 ```bash
+mkdir Work
+cd Work
 docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
+docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
+    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \
+    /bin/bash
+docker exec -it fastdeploy-xpu /bin/bash
 ```
 
 ## 2. 使用 Pip 安装
@@ -43,13 +49,13 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
 ### 安装 FastDeploy（**注意不要通过 pypi 源安装**）
 
 ```bash
-python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
+python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 ```
 
 或者你也可以安装最新版 FastDeploy（不推荐）
 
 ```bash
-python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 ```
 
 ## 3. 从源码编译安装
@@ -101,58 +107,28 @@ bash build.sh
 ## 验证是否安装成功
 
 ```python
-import paddle
-from paddle.jit.marker import unified
-paddle.utils.run_check()
-from fastdeploy.model_executor.ops.xpu import block_attn
+python -c "import paddle; paddle.version.show()"
+python -c "import paddle; paddle.utils.run_check()"
+python -c "from paddle.jit.marker import unified"
+python -c "from fastdeploy.model_executor.ops.xpu import block_attn"
 ```
 
 如果上述步骤均执行成功，代表 FastDeploy 已安装成功。
 
 ## 快速开始
 
-目前 P800 暂时仅验证了以下模型的部署：
-- ERNIE-4.5-300B-A47B-Paddle 32K WINT4（8卡）
-- ERNIE-4.5-300B-A47B-Paddle 128K WINT4（8卡）
-
-### 离线推理
-
-安装 FastDeploy 后，您可以通过如下代码，基于用户给定的输入完成离线推理生成文本。
-
-```python
-from fastdeploy import LLM, SamplingParams
-
-prompts = [
-    "Where is the capital of China?",
-]
-
-# 采样参数
-sampling_params = SamplingParams(top_p=0.95)
-
-# 加载模型
-llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
-
-# 批量进行推理（llm内部基于资源情况进行请求排队、动态插入处理）
-outputs = llm.generate(prompts, sampling_params)
-
-# 输出结果
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs.text
-
-    print(f"Prompt: {prompt}")
-    print(f"Generated text: {generated_text}")
-```
-
-更多参数可以参考文档 [参数说明](../../parameters.md)。
+P800 支持 ```ERNIE-4.5-300B-A47B-Paddle``` 模型采用以下配置部署（注意：不同配置在效果、性能上可能存在差异）。
+- 32K WINT4 8 卡（推荐）
+- 128K WINT4 8 卡
+- 32K WINT4 4 卡
 
 ### OpenAI 兼容服务器
 
 您还可以通过如下命令，基于 FastDeploy 实现 OpenAI API 协议兼容的服务器部署。
 
 #### 启动服务
 
-**ERNIE-4.5-300B-A47B-Paddle 32K WINT4（8卡）（推荐）**
+**基于 WINT4 精度和 32K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 8 卡 P800 服务器（推荐）**
 
 ```bash
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -165,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
     --gpu-memory-utilization 0.9
 ```
 
-**ERNIE-4.5-300B-A47B-Paddle 128K WINT4（8卡）**
+**基于 WINT4 精度和 128K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 8 卡 P800 服务器**
 
 ```bash
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -178,6 +154,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
     --gpu-memory-utilization 0.9
 ```
 
+**基于 WINT4 精度和 32K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 4 卡 P800 服务器**
+
+```bash
+export XPU_VISIBLE_DEVICES="0,1,2,3"
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8188 \
+    --tensor-parallel-size 4 \
+    --max-model-len 32768 \
+    --max-num-seqs 64 \
+    --quantization "wint4" \
+    --gpu-memory-utilization 0.9
+```
+
 更多参数可以参考 [参数说明](../../parameters.md)。
 
 #### 请求服务
@@ -212,7 +202,6 @@ print('\n')
 response = client.chat.completions.create(
     model="null",
     messages=[
-        {"role": "system", "content": "I'm a helpful AI assistant."},
         {"role": "user", "content": "Where is the capital of China?"},
     ],
     stream=True,
diff --git a/docs/zh/get_started/installation/nvidia_gpu.md b/docs/zh/get_started/installation/nvidia_gpu.md
@@ -11,6 +11,9 @@
 可通过如下4种方式进行安装
 
 ## 1. 预编译Docker安装(推荐)
+
+**注意**： 如下镜像仅支持SM 80/90架构GPU（A800/H800等），如果你是在L20/L40/4090等SM 86/69架构的GPU上部署，请在创建容器后，卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。
+
 ``` shell
 docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
 ```
diff --git a/docs/zh/quantization/online_quantization.md b/docs/zh/quantization/online_quantization.md
@@ -22,7 +22,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --max-num-seqs 32
 ```
 
-- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型，更多说明参考[支持模型列表](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md)。
+- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型，更多说明参考[支持模型列表](../supported_models.md)。
 - 通过设置 `--quantization` 为 `wint8` 或 `wint4` 选择在线 INT8/INT4 量化。 
 - 部署 ERNIE-4.5-300B-A47B-Paddle WINT8 最少需要 80G * 8卡, WINT4 则需要 80GB * 4卡。
 - 更多部署教程请参考[get_started](../get_started/ernie-4.5.md).
@@ -48,7 +48,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --max-num-seqs 32
 ```
 
-- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型，更多说明参考[支持模型列表](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md)。
+- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型，更多说明参考[支持模型列表](../supported_models.md)。
 - 通过设置 `--quantization` 为 `block_wise_fp8` 选择在线 Block-wise FP8 量化。 
 - 部署 ERNIE-4.5-300B-A47B-Paddle Block-wise FP8 最少需要 80G * 8卡。
 - 更多部署教程请参考[get_started](../get_started/ernie-4.5.md)
diff --git a/docs/zh/quantization/wint2.md b/docs/zh/quantization/wint2.md
@@ -46,7 +46,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 ```
 
 - 更多部署教程请参考[get_started](../get_started/ernie-4.5.md)；
-- 更多模型说明请参考[支持模型列表](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md)。
+- 更多模型说明请参考[支持模型列表](../supported_models.md)。
 
 
 ## WINT2效果
diff --git a/fastdeploy/model_executor/layers/moe/fused_moe_wint2_backend.py b/fastdeploy/model_executor/layers/moe/fused_moe_wint2_backend.py
@@ -18,6 +18,8 @@
 from paddle import nn
 
 import fastdeploy
+from fastdeploy.distributed.communication_op import \
+    tensor_model_parallel_all_reduce
 
 from ..quantization.quant_base import QuantMethodBase
 from ..utils import create_and_set_parameter, get_tensor
@@ -222,7 +224,6 @@ def apply(
         )
 
         from fastdeploy.model_executor.ops.gpu import moe_expert_reduce
-
         fused_moe_out = moe_expert_reduce(
             ffn_out,
             topk_weights,
@@ -233,4 +234,7 @@ def apply(
             routed_scaling_factor=1.0,
         )
 
+        if layer.tp_size > 1:
+            tensor_model_parallel_all_reduce(fused_moe_out)
+
         return fused_moe_out

Original file line number	Diff line number	Diff line change
`@@ -36,6 +36,8 @@ jobs:`
`36`	`36`	`rm -rf ${REPO_NAME}`
`37`	`37`	`fi`
`38`	`38`	`'`
	`39`	`+ git config --global user.name "FastDeployCI"`
	`40`	`+ git config --global user.email "fastdeploy_ci@example.com"`
`39`	`41`	`git clone ${REPO} ${REPO_NAME}`
`40`	`42`	`cd FastDeploy`
`41`	`43`	`if [ "${{ github.event_name }}" = "pull_request" ]; then`