Skip to content

Commit a42ea04

Browse files
committed
fix docs
1 parent 6c48f95 commit a42ea04

File tree

6 files changed

+110
-168
lines changed

6 files changed

+110
-168
lines changed

docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md

Lines changed: 3 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Best Practice for ERNIE-4.5-0.3B
1+
# ERNIE-4.5-0.3B
22
## Environmental Preparation
33
### 1.1 Hardware requirements
44
The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following hardware for each quantization is as follows:
@@ -16,31 +16,7 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
1616
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
1717

1818
### 1.2 Install fastdeploy and prepare the model
19-
- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
20-
```
21-
GPU Driver >= 535
22-
CUDA >= 12.3
23-
CUDNN >= 9.5
24-
Linux X86_64
25-
Python >= 3.10
26-
```
27-
For SM 80/90 GPU(A30/A100/H100/)
28-
```
29-
# Install stable release
30-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
31-
32-
# Install latest Nightly build
33-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
34-
```
35-
For SM 86/89 GPU(A10/4090/L20/L40)
36-
```
37-
# Install stable release
38-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
39-
40-
# Install latest Nightly build
41-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
42-
```
43-
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
19+
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
4420

4521
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**
4622
- Just specify the model name(e.g. `baidu/ERNIE-4.5-0.3B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
@@ -58,9 +34,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
5834
--kv-cache-ratio 0.75 \
5935
--max-num-seqs 128
6036
```
61-
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
37+
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
6238
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
63-
- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
6439

6540
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
6641

docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md

Lines changed: 49 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Best Practice for ERNIE-4.5-21B-A3B
1+
# ERNIE-4.5-21B-A3B
22
## Environmental Preparation
33
### 1.1 Hardware requirements
44
The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the following hardware for each quantization is as follows:
@@ -16,31 +16,7 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
1616
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
1717

1818
### 1.2 Install fastdeploy and prepare the model
19-
- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
20-
```
21-
GPU Driver >= 535
22-
CUDA >= 12.3
23-
CUDNN >= 9.5
24-
Linux X86_64
25-
Python >= 3.10
26-
```
27-
For SM 80/90 GPU(A30/A100/H100/)
28-
```
29-
# Install stable release
30-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
31-
32-
# Install latest Nightly build
33-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
34-
```
35-
For SM 86/89 GPU(A10/4090/L20/L40)
36-
```
37-
# Install stable release
38-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
39-
40-
# Install latest Nightly build
41-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
42-
```
43-
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
19+
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
4420

4521
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**
4622
- Just specify the model name(e.g. `baidu/ERNIE-4.5-21B-A3B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
@@ -58,9 +34,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
5834
--kv-cache-ratio 0.75 \
5935
--max-num-seqs 128
6036
```
61-
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
37+
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
6238
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
63-
- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
6439

6540
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
6641

@@ -126,5 +101,51 @@ Add the following environment variables before starting
126101
export FD_SAMPLING_CLASS=rejection
127102
```
128103

104+
#### 2.2.7 Disaggregated Deployment
105+
**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
106+
107+
**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
108+
```
109+
# prefill
110+
export CUDA_VISIBLE_DEVICES=0,1,2,3
111+
export INFERENCE_MSG_QUEUE_ID=1315
112+
export FLAGS_max_partition_size=2048
113+
export FD_ATTENTION_BACKEND=FLASH_ATTN
114+
export FD_LOG_DIR="prefill_log"
115+
116+
quant_type=block_wise_fp8
117+
export FD_USE_DEEP_GEMM=0
118+
119+
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
120+
--max-model-len 131072 \
121+
--max-num-seqs 20 \
122+
--num-gpu-blocks-override 40000 \
123+
--quantization ${quant_type} \
124+
--gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
125+
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
126+
--cache-queue-port 7015 \
127+
--splitwise-role "prefill" \
128+
```
129+
```
130+
# decode
131+
export CUDA_VISIBLE_DEVICES=4,5,6,7
132+
export INFERENCE_MSG_QUEUE_ID=1215
133+
export FLAGS_max_partition_size=2048
134+
export FD_LOG_DIR="decode_log"
135+
136+
quant_type=block_wise_fp8
137+
export FD_USE_DEEP_GEMM=0
138+
139+
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
140+
--max-model-len 131072 \
141+
--max-num-seqs 20 \
142+
--quantization ${quant_type} \
143+
--gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
144+
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
145+
--cache-queue-port 8015 \
146+
--innode-prefill-ports 7013 \
147+
--splitwise-role "decode"
148+
```
149+
129150
## FAQ
130151
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md

Lines changed: 3 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Best Practice for ERNIE-4.5-300B-A47B
1+
# ERNIE-4.5-300B-A47B
22
## Environmental Preparation
33
### 1.1 Hardware requirements
44
The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the following hardware for each quantization is as follows:
@@ -13,31 +13,7 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the follo
1313
3. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
1414

1515
### 1.2 Install fastdeploy and prepare the model
16-
- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
17-
```
18-
GPU Driver >= 535
19-
CUDA >= 12.3
20-
CUDNN >= 9.5
21-
Linux X86_64
22-
Python >= 3.10
23-
```
24-
For SM 80/90 GPU(A30/A100/H100/)
25-
```
26-
# Install stable release
27-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
28-
29-
# Install latest Nightly build
30-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
31-
```
32-
For SM 86/89 GPU(A10/4090/L20/L40)
33-
```
34-
# Install stable release
35-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
36-
37-
# Install latest Nightly build
38-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
39-
```
40-
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
16+
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
4117

4218
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**
4319
- Just specify the model name(e.g. `baidu/ERNIE-4.5-300B-A47B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
@@ -55,9 +31,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
5531
--kv-cache-ratio 0.75 \
5632
--max-num-seqs 128
5733
```
58-
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
34+
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
5935
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
60-
- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
6136

6237
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
6338

docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md

Lines changed: 3 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# ERNIE-4.5-0.3B 最佳实践
1+
# ERNIE-4.5-0.3B
22
## 一、环境准备
33
### 1.1 支持情况
44
ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡数如下:
@@ -16,31 +16,7 @@ ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡
1616
2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署
1717

1818
### 1.2 安装fastdeploy与准备模型
19-
- 安装:在开始部署前,请确保你的硬件环境满足如下条件:
20-
```
21-
GPU驱动 >= 535
22-
CUDA >= 12.3
23-
CUDNN >= 9.5
24-
Linux X86_64
25-
Python >= 3.10
26-
```
27-
针对SM 80/90的GPU(A30/A100/H100/)
28-
```
29-
# Install stable release
30-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
31-
32-
# Install latest Nightly build
33-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
34-
```
35-
针对SM 86/89的GPU(A10/4090/L20/L40)
36-
```
37-
# Install stable release
38-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
39-
40-
# Install latest Nightly build
41-
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
42-
```
43-
安装详情,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
19+
- 安装请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
4420

4521
- 模型下载,**请注意使用Fastdeploy部署需要Paddle后缀的模型**
4622
- 执行时直接指定模型名(如`baidu/ERNIE-4.5-0.3B-Paddle`)即可自动下载,默认下载路径为 `~/`(即用户主目录),也可以通过配置环境变量 `FD_MODEL_CACHE`修改默认下载的路径
@@ -59,9 +35,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
5935
--max-num-seqs 128
6036
```
6137
其中:
62-
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。
38+
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
6339
- `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。
64-
- `--kv-cache-ratio`: 表示KVCache块按kv_cache_ratio比例分给Prefill阶段和Decode阶段。设置不合理会导致某个阶段的KVCache块不足,从而影响性能。如果开启服务管理全局Block功能,可以不用设置。
6540

6641
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)
6742

0 commit comments

Comments
 (0)