Skip to content

Commit 699487c

Browse files
authored
Merge branch 'PaddlePaddle:develop' into feat/blackwell-sm100-support
2 parents b918910 + e3aac0c commit 699487c

File tree

11 files changed

+86
-92
lines changed

11 files changed

+86
-92
lines changed

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ jobs:
3636
rm -rf ${REPO_NAME}
3737
fi
3838
'
39+
git config --global user.name "FastDeployCI"
40+
git config --global user.email "fastdeploy_ci@example.com"
3941
git clone ${REPO} ${REPO_NAME}
4042
cd FastDeploy
4143
if [ "${{ github.event_name }}" = "pull_request" ]; then

docs/get_started/installation/kunlunxin_xpu.md

Lines changed: 33 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ Verified platform:
2323
## 1. Set up using Docker (Recommended)
2424

2525
```bash
26+
mkdir Work
27+
cd Work
2628
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
29+
docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
30+
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \
31+
/bin/bash
32+
docker exec -it fastdeploy-xpu /bin/bash
2733
```
2834

2935
## 2. Set up using pre-built wheels
@@ -43,13 +49,13 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
4349
### Install FastDeploy (**Do NOT install via PyPI source**)
4450

4551
```bash
46-
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
52+
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
4753
```
4854

4955
Alternatively, you can install the latest version of FastDeploy (Not recommended)
5056

5157
```bash
52-
python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
58+
python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
5359
```
5460

5561
## 3. Build wheel from source
@@ -99,55 +105,29 @@ The compiled outputs will be located in the ```FastDeploy/dist``` directory.
99105

100106
## Installation verification
101107

102-
```python
103-
import paddle
104-
from paddle.jit.marker import unified
105-
paddle.utils.run_check()
106-
from fastdeploy.model_executor.ops.xpu import block_attn
108+
```bash
109+
python -c "import paddle; paddle.version.show()"
110+
python -c "import paddle; paddle.utils.run_check()"
111+
python -c "from paddle.jit.marker import unified"
112+
python -c "from fastdeploy.model_executor.ops.xpu import block_attn"
107113
```
108114

109115
If all the above steps execute successfully, FastDeploy is installed correctly.
110116

111117
## Quick start
112118

113-
Currently, P800 has only validated deployment of the following models:
114-
- ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card)
115-
- ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)
116-
117-
### Offline inference
118-
119-
After installing FastDeploy, you can perform offline text generation with user-provided prompts using the following code,
120-
121-
```python
122-
from fastdeploy import LLM, SamplingParams
123-
124-
prompts = [
125-
"Where is the capital of China?",
126-
]
127-
128-
sampling_params = SamplingParams(top_p=0.95)
129-
130-
llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
131-
132-
outputs = llm.generate(prompts, sampling_params)
133-
134-
for output in outputs:
135-
prompt = output.prompt
136-
generated_text = output.outputs.text
137-
138-
print(f"Prompt: {prompt}")
139-
print(f"Generated text: {generated_text}")
140-
```
141-
142-
Refer to [Parameters](../../parameters.md) for more configuration options.
119+
The P800 supports the deployment of the ```ERNIE-4.5-300B-A47B-Paddle``` model using the following configurations (Note: Different configurations may result in variations in performance).
120+
- 32K WINT4 with 8 XPUs (Recommended)
121+
- 128K WINT4 with 8 XPUs
122+
- 32K WINT4 with 4 XPUs
143123

144124
### Online serving (OpenAI API-Compatible server)
145125

146126
Deploy an OpenAI API-compatible server using FastDeploy with the following commands:
147127

148128
#### Start service
149129

150-
**ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card) (Recommended)**
130+
**Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 8 XPUs(Recommended)**
151131

152132
```bash
153133
python -m fastdeploy.entrypoints.openai.api_server \
@@ -160,7 +140,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
160140
--gpu-memory-utilization 0.9
161141
```
162142

163-
**ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)**
143+
**Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 128K context length on 8 XPUs**
164144

165145
```bash
166146
python -m fastdeploy.entrypoints.openai.api_server \
@@ -173,6 +153,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
173153
--gpu-memory-utilization 0.9
174154
```
175155

156+
**Deploy the ERNIE-4.5-300B-A47B-Paddle model with WINT4 precision and 32K context length on 4 XPUs**
157+
158+
```bash
159+
export XPU_VISIBLE_DEVICES="0,1,2,3"
160+
python -m fastdeploy.entrypoints.openai.api_server \
161+
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
162+
--port 8188 \
163+
--tensor-parallel-size 4 \
164+
--max-model-len 32768 \
165+
--max-num-seqs 64 \
166+
--quantization "wint4" \
167+
--gpu-memory-utilization 0.9
168+
```
169+
176170
Refer to [Parameters](../../parameters.md) for more options.
177171

178172
#### Send requests
@@ -207,7 +201,6 @@ print('\n')
207201
response = client.chat.completions.create(
208202
model="null",
209203
messages=[
210-
{"role": "system", "content": "I'm a helpful AI assistant."},
211204
{"role": "user", "content": "Where is the capital of China?"},
212205
],
213206
stream=True,

docs/get_started/installation/nvidia_gpu.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ The following installation methods are available when your environment meets the
99
- Linux X86_64
1010

1111
## 1. Pre-built Docker Installation (Recommended)
12+
13+
**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800),if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdpeloy-gpu``` after you create the container.
14+
1215
```shell
1316
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
1417
```

docs/quantization/online_quantization.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
2222
--max-num-seqs 32
2323
```
2424

25-
- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md).
25+
- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](../supported_models.md).
2626
- By setting `--quantization` to `wint8` or `wint4`, online INT8/INT4 quantization can be selected.
2727
- Deploying ERNIE-4.5-300B-A47B-Paddle WINT8 requires at least 80G * 8 cards, while WINT4 requires 80GB * 4 cards.
2828
- For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md).
@@ -48,7 +48,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
4848
--max-num-seqs 32
4949
```
5050

51-
- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md).
51+
- By specifying `--model baidu/ERNIE-4.5-300B-A47B-Paddle`, the model can be automatically downloaded from AIStudio. FastDeploy depends on Paddle format models. For more information, please refer to [Supported Model List](../supported_models.md).
5252
- By setting `--quantization` to `block_wise_fp8`, online Block-wise FP8 quantization can be selected.
5353
- Deploying ERNIE-4.5-300B-A47B-Paddle Block-wise FP8 requires at least 80G * 8 cards.
5454
- For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md)

docs/quantization/wint2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ Example of quantization configuration in the model's config.json file:
4646
```
4747

4848
- For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md);
49-
- For more model descriptions, please refer to [Supported Model List](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md).
49+
- For more model descriptions, please refer to [Supported Model List](../supported_models.md).
5050

5151
## WINT2 Performance
5252

docs/zh/features/speculative_decoding.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,13 +108,13 @@ python -m fastdeploy.entrypoints.openai.api_server \
108108
```
109109

110110
## 🧠 使用 Ngram 解码
111-
该算法通过 n-gram 窗口从 prompt 和已生成的 Token 中进行匹配生成草稿 Token,适合输入和输出有很大 overlap 的场景如代码编辑、文档查询等查看论文地址
111+
该算法通过 n-gram 窗口从 prompt 和已生成的 Token 中进行匹配生成草稿 Token,适合输入和输出有很大 overlap 的场景,如代码续写、文档查询等
112112
> 使用 4×H100;量化方式选择 WINT4
113113
> 配置文件:benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml
114114
```
115115
python -m fastdeploy.entrypoints.openai.api_server \
116116
--model ${path_to_main_model} \
117117
--tensor-parallel-size 4 \
118118
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
119-
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
119+
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
120120
```

docs/zh/get_started/installation/kunlunxin_xpu.md

Lines changed: 32 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@
2323
## 1. 使用 Docker 安装(推荐)
2424

2525
```bash
26+
mkdir Work
27+
cd Work
2628
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
29+
docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
30+
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \
31+
/bin/bash
32+
docker exec -it fastdeploy-xpu /bin/bash
2733
```
2834

2935
## 2. 使用 Pip 安装
@@ -43,13 +49,13 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
4349
### 安装 FastDeploy(**注意不要通过 pypi 源安装**
4450

4551
```bash
46-
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
52+
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
4753
```
4854

4955
或者你也可以安装最新版 FastDeploy(不推荐)
5056

5157
```bash
52-
python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
58+
python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
5359
```
5460

5561
## 3. 从源码编译安装
@@ -101,58 +107,28 @@ bash build.sh
101107
## 验证是否安装成功
102108

103109
```python
104-
import paddle
105-
from paddle.jit.marker import unified
106-
paddle.utils.run_check()
107-
from fastdeploy.model_executor.ops.xpu import block_attn
110+
python -c "import paddle; paddle.version.show()"
111+
python -c "import paddle; paddle.utils.run_check()"
112+
python -c "from paddle.jit.marker import unified"
113+
python -c "from fastdeploy.model_executor.ops.xpu import block_attn"
108114
```
109115

110116
如果上述步骤均执行成功,代表 FastDeploy 已安装成功。
111117

112118
## 快速开始
113119

114-
目前 P800 暂时仅验证了以下模型的部署:
115-
- ERNIE-4.5-300B-A47B-Paddle 32K WINT4(8卡)
116-
- ERNIE-4.5-300B-A47B-Paddle 128K WINT4(8卡)
117-
118-
### 离线推理
119-
120-
安装 FastDeploy 后,您可以通过如下代码,基于用户给定的输入完成离线推理生成文本。
121-
122-
```python
123-
from fastdeploy import LLM, SamplingParams
124-
125-
prompts = [
126-
"Where is the capital of China?",
127-
]
128-
129-
# 采样参数
130-
sampling_params = SamplingParams(top_p=0.95)
131-
132-
# 加载模型
133-
llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
134-
135-
# 批量进行推理(llm内部基于资源情况进行请求排队、动态插入处理)
136-
outputs = llm.generate(prompts, sampling_params)
137-
138-
# 输出结果
139-
for output in outputs:
140-
prompt = output.prompt
141-
generated_text = output.outputs.text
142-
143-
print(f"Prompt: {prompt}")
144-
print(f"Generated text: {generated_text}")
145-
```
146-
147-
更多参数可以参考文档 [参数说明](../../parameters.md)
120+
P800 支持 ```ERNIE-4.5-300B-A47B-Paddle``` 模型采用以下配置部署(注意:不同配置在效果、性能上可能存在差异)。
121+
- 32K WINT4 8 卡(推荐)
122+
- 128K WINT4 8 卡
123+
- 32K WINT4 4 卡
148124

149125
### OpenAI 兼容服务器
150126

151127
您还可以通过如下命令,基于 FastDeploy 实现 OpenAI API 协议兼容的服务器部署。
152128

153129
#### 启动服务
154130

155-
**ERNIE-4.5-300B-A47B-Paddle 32K WINT4(8卡)(推荐)**
131+
**基于 WINT4 精度和 32K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 8 卡 P800 服务器(推荐)**
156132

157133
```bash
158134
python -m fastdeploy.entrypoints.openai.api_server \
@@ -165,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
165141
--gpu-memory-utilization 0.9
166142
```
167143

168-
**ERNIE-4.5-300B-A47B-Paddle 128K WINT4(8卡)**
144+
**基于 WINT4 精度和 128K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 8 卡 P800 服务器**
169145

170146
```bash
171147
python -m fastdeploy.entrypoints.openai.api_server \
@@ -178,6 +154,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
178154
--gpu-memory-utilization 0.9
179155
```
180156

157+
**基于 WINT4 精度和 32K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 4 卡 P800 服务器**
158+
159+
```bash
160+
export XPU_VISIBLE_DEVICES="0,1,2,3"
161+
python -m fastdeploy.entrypoints.openai.api_server \
162+
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
163+
--port 8188 \
164+
--tensor-parallel-size 4 \
165+
--max-model-len 32768 \
166+
--max-num-seqs 64 \
167+
--quantization "wint4" \
168+
--gpu-memory-utilization 0.9
169+
```
170+
181171
更多参数可以参考 [参数说明](../../parameters.md)
182172

183173
#### 请求服务
@@ -212,7 +202,6 @@ print('\n')
212202
response = client.chat.completions.create(
213203
model="null",
214204
messages=[
215-
{"role": "system", "content": "I'm a helpful AI assistant."},
216205
{"role": "user", "content": "Where is the capital of China?"},
217206
],
218207
stream=True,

docs/zh/get_started/installation/nvidia_gpu.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@
1111
可通过如下4种方式进行安装
1212

1313
## 1. 预编译Docker安装(推荐)
14+
15+
**注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/69架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。
16+
1417
``` shell
1518
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
1619
```

docs/zh/quantization/online_quantization.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
2222
--max-num-seqs 32
2323
```
2424

25-
- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型,更多说明参考[支持模型列表](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md)
25+
- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型,更多说明参考[支持模型列表](../supported_models.md)
2626
- 通过设置 `--quantization``wint8``wint4` 选择在线 INT8/INT4 量化。
2727
- 部署 ERNIE-4.5-300B-A47B-Paddle WINT8 最少需要 80G * 8卡, WINT4 则需要 80GB * 4卡。
2828
- 更多部署教程请参考[get_started](../get_started/ernie-4.5.md).
@@ -48,7 +48,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
4848
--max-num-seqs 32
4949
```
5050

51-
- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型,更多说明参考[支持模型列表](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md)
51+
- 通过指定 `--model baidu/ERNIE-4.5-300B-A47B-Paddle` 可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型,更多说明参考[支持模型列表](../supported_models.md)
5252
- 通过设置 `--quantization``block_wise_fp8` 选择在线 Block-wise FP8 量化。
5353
- 部署 ERNIE-4.5-300B-A47B-Paddle Block-wise FP8 最少需要 80G * 8卡。
5454
- 更多部署教程请参考[get_started](../get_started/ernie-4.5.md)

docs/zh/quantization/wint2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
4646
```
4747

4848
- 更多部署教程请参考[get_started](../get_started/ernie-4.5.md)
49-
- 更多模型说明请参考[支持模型列表](https://console.cloud.baidu-int.com/devops/icode/repos/baidu/paddle_internal/FastDeploy/blob/feature%2Finference-refactor-20250528/docs/supported_models.md)
49+
- 更多模型说明请参考[支持模型列表](../supported_models.md)
5050

5151

5252
## WINT2效果

0 commit comments

Comments
 (0)