fix docs

zoooo0820 · zoooo0820 · commit a42ea04baac1 · 2025-07-29T17:55:00.000+08:00
diff --git a/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md
@@ -1,4 +1,4 @@
-# Best Practice for ERNIE-4.5-0.3B
+# ERNIE-4.5-0.3B
 ## Environmental Preparation
 ### 1.1 Hardware requirements
 The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following hardware for each quantization is as follows:
@@ -16,31 +16,7 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
 2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
 
 ### 1.2 Install fastdeploy and prepare the model
-- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
-```
-GPU Driver >= 535
-CUDA >= 12.3
-CUDNN >= 9.5
-Linux X86_64
-Python >= 3.10
-```
-For SM 80/90 GPU（A30/A100/H100/）
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-For SM 86/89 GPU（A10/4090/L20/L40)
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
+- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
 
 - Model Download，**Please note that models with Paddle suffix need to be used for Fastdeploy**：
   - Just specify the model name（e.g. `baidu/ERNIE-4.5-0.3B-Paddle`）to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
@@ -58,9 +34,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --kv-cache-ratio 0.75 \
        --max-num-seqs 128
 ```
-- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
+- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
 - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
-- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
 
 For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
 
diff --git a/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md
@@ -1,4 +1,4 @@
-# Best Practice for ERNIE-4.5-21B-A3B
+# ERNIE-4.5-21B-A3B
 ## Environmental Preparation
 ### 1.1 Hardware requirements
 The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the following hardware for each quantization is as follows:
@@ -16,31 +16,7 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
 2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
 
 ### 1.2 Install fastdeploy and prepare the model
-- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
-```
-GPU Driver >= 535
-CUDA >= 12.3
-CUDNN >= 9.5
-Linux X86_64
-Python >= 3.10
-```
-For SM 80/90 GPU（A30/A100/H100/）
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-For SM 86/89 GPU（A10/4090/L20/L40)
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
+- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
 
 - Model Download，**Please note that models with Paddle suffix need to be used for Fastdeploy**：
   - Just specify the model name（e.g. `baidu/ERNIE-4.5-21B-A3B-Paddle`）to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
@@ -58,9 +34,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --kv-cache-ratio 0.75 \
        --max-num-seqs 128
 ```
-- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
+- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
 - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
-- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
 
 For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
 
@@ -126,5 +101,51 @@ Add the following environment variables before starting
 export FD_SAMPLING_CLASS=rejection
 ```
 
+#### 2.2.7 Disaggregated Deployment
+**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
+
+**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
+```
+# prefill
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export INFERENCE_MSG_QUEUE_ID=1315
+export FLAGS_max_partition_size=2048
+export FD_ATTENTION_BACKEND=FLASH_ATTN
+export FD_LOG_DIR="prefill_log"
+
+quant_type=block_wise_fp8
+export FD_USE_DEEP_GEMM=0
+
+python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
+    --max-model-len 131072 \
+    --max-num-seqs 20 \
+    --num-gpu-blocks-override 40000 \
+    --quantization ${quant_type} \
+    --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
+    --port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
+    --cache-queue-port 7015 \
+    --splitwise-role "prefill" \
+```
+```
+# decode
+export CUDA_VISIBLE_DEVICES=4,5,6,7
+export INFERENCE_MSG_QUEUE_ID=1215
+export FLAGS_max_partition_size=2048
+export FD_LOG_DIR="decode_log"
+
+quant_type=block_wise_fp8
+export FD_USE_DEEP_GEMM=0
+
+python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
+    --max-model-len 131072 \
+    --max-num-seqs 20 \
+    --quantization ${quant_type} \
+    --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
+    --port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
+    --cache-queue-port 8015 \
+    --innode-prefill-ports 7013 \
+    --splitwise-role "decode"
+```
+
 ## FAQ
 If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).
diff --git a/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md
@@ -1,4 +1,4 @@
-# Best Practice for ERNIE-4.5-300B-A47B
+# ERNIE-4.5-300B-A47B
 ## Environmental Preparation
 ### 1.1 Hardware requirements
 The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the following hardware for each quantization is as follows:
@@ -13,31 +13,7 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the follo
 3. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
 
 ### 1.2 Install fastdeploy and prepare the model
-- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
-```
-GPU Driver >= 535
-CUDA >= 12.3
-CUDNN >= 9.5
-Linux X86_64
-Python >= 3.10
-```
-For SM 80/90 GPU（A30/A100/H100/）
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-For SM 86/89 GPU（A10/4090/L20/L40)
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
+- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
 
 - Model Download，**Please note that models with Paddle suffix need to be used for Fastdeploy**：
   - Just specify the model name（e.g. `baidu/ERNIE-4.5-300B-A47B-Paddle`）to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
@@ -55,9 +31,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --kv-cache-ratio 0.75 \
        --max-num-seqs 128
 ```
-- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
+- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
 - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
-- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
 
 For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
 
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md
@@ -1,4 +1,4 @@
-# ERNIE-4.5-0.3B 最佳实践
+# ERNIE-4.5-0.3B
 ## 一、环境准备
 ### 1.1 支持情况
 ERNIE-4.5-0.3B 各量化精度，在下列硬件上部署所需要的最小卡数如下：
@@ -16,31 +16,7 @@ ERNIE-4.5-0.3B 各量化精度，在下列硬件上部署所需要的最小卡
 2. 表格中未列出的硬件，可根据显存大小进行预估是否可以部署
 
 ### 1.2 安装fastdeploy与准备模型
-- 安装：在开始部署前，请确保你的硬件环境满足如下条件：
-```
-GPU驱动 >= 535
-CUDA >= 12.3
-CUDNN >= 9.5
-Linux X86_64
-Python >= 3.10
-```
-针对SM 80/90的GPU（A30/A100/H100/）
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-针对SM 86/89的GPU（A10/4090/L20/L40)
-```
-# Install stable release
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
-# Install latest Nightly build
-python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-```
-安装详情，请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
+- 安装请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
 
 - 模型下载，**请注意使用Fastdeploy部署需要Paddle后缀的模型**：
   - 执行时直接指定模型名（如`baidu/ERNIE-4.5-0.3B-Paddle`）即可自动下载，默认下载路径为 `~/`(即用户主目录)，也可以通过配置环境变量 `FD_MODEL_CACHE`修改默认下载的路径
@@ -59,9 +35,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --max-num-seqs 128
 ```
 其中：
-- `--quantization`: 表示模型采用的量化策略。不同量化策略，模型的性能和精度也会不同。
+- `--quantization`: 表示模型采用的量化策略。不同量化策略，模型的性能和精度也会不同。可选值包括：`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
 - `--max-model-len`：表示当前部署的服务所支持的最长Token数量。设置得越大，模型可支持的上下文长度也越大，但相应占用的显存也越多，可能影响并发数。
-- `--kv-cache-ratio`: 表示KVCache块按kv_cache_ratio比例分给Prefill阶段和Decode阶段。设置不合理会导致某个阶段的KVCache块不足，从而影响性能。如果开启服务管理全局Block功能，可以不用设置。
 
 更多的参数含义与默认设置，请参见[FastDeploy参数说明](../parameters.md)。
 
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md