Skip to content

Commit 0747136

Browse files
committed
[doc] best practice for eb45 text models
1 parent a39a673 commit 0747136

File tree

8 files changed

+883
-0
lines changed

8 files changed

+883
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Best Practice for ERNIE-4.5-0.3B
2+
## Environmental Preparation
3+
### 1.1 Hardware requirements
4+
The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following hardware for each quantization is as follows:
5+
| | WINT8 | WINT4 | FP8 |
6+
|-----|-----|-----|-----|
7+
|H800 80GB| 1 | 1 | 1 |
8+
|A800 80GB| 1 | 1 | / |
9+
|H20 96GB| 1 | 1 | 1 |
10+
|L20 48GB| 1 | 1 | 1 |
11+
|A30 40GB| 1 | 1 | / |
12+
|A10 24GB| 1 | 1 | / |
13+
14+
**Tips:**
15+
1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 2` in starting command.
16+
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
17+
18+
### 1.2 Install fastdeploy and prepare the model
19+
- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
20+
```
21+
GPU Driver >= 535
22+
CUDA >= 12.3
23+
CUDNN >= 9.5
24+
Linux X86_64
25+
Python >= 3.10
26+
```
27+
For SM 80/90 GPU(A30/A100/H100/)
28+
```
29+
# Install stable release
30+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
31+
32+
# Install latest Nightly build
33+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
34+
```
35+
For SM 86/89 GPU(A10/4090/L20/L40)
36+
```
37+
# Install stable release
38+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
39+
40+
# Install latest Nightly build
41+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
42+
```
43+
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
44+
45+
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**
46+
- Just specify the model name(e.g. `baidu/ERNIE-4.5-0.3B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
47+
- If affected by network or other factors, you can also download the model through [huggingface](https://huggingface.co/), [modelscope](https://www.modelscope.cn/home), etc., and specify the model path when starting service
48+
49+
## Start the Service
50+
51+
Start the service by following command:
52+
```bash
53+
python -m fastdeploy.entrypoints.openai.api_server \
54+
--model baidu/ERNIE-4.5-0.3B-Paddle \
55+
--tensor-parallel-size 1 \
56+
--quantization wint4 \
57+
--max-model-len 32768 \
58+
--kv-cache-ratio 0.75 \
59+
--max-num-seqs 128
60+
```
61+
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
62+
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
63+
- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
64+
65+
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
66+
67+
### 2.2 Advanced: How to get better performance
68+
#### 2.2.1 Correctly set parameters that match the application scenario
69+
Evaluate average input length, average output length, and maximum context length
70+
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
71+
- **Enable the service management global block**
72+
73+
```
74+
export ENABLE_V1_KVCACHE_SCHEDULER=1
75+
```
76+
77+
#### 2.2.2 Prefix Caching
78+
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
79+
80+
**How to enable:**
81+
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
82+
```
83+
--enable-prefix-caching
84+
--swap-space 50
85+
```
86+
87+
#### 2.2.3 Chunked Prefill
88+
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
89+
90+
**How to enable:** Add the following lines to the startup parameters
91+
```
92+
--enable-chunked-prefill
93+
```
94+
95+
#### 2.2.4 CudaGraph
96+
**Idea:**
97+
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
98+
99+
**How to enable:**
100+
Add the following lines to the startup parameters
101+
```
102+
--use-cudagraph
103+
```
104+
Notes:
105+
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
106+
2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1`
107+
3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time
108+
109+
#### 2.2.6 Rejection Sampling
110+
**Idea:**
111+
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
112+
113+
**How to enable:**
114+
Add the following environment variables before starting
115+
```
116+
export FD_SAMPLING_CLASS=rejection
117+
```
118+
119+
## FAQ
120+
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Best Practice for ERNIE-4.5-21B-A3B
2+
## Environmental Preparation
3+
### 1.1 Hardware requirements
4+
The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the following hardware for each quantization is as follows:
5+
| | WINT8 | WINT4 | FP8 |
6+
|-----|-----|-----|-----|
7+
|H800 80GB| 1 | 1 | 1 |
8+
|A800 80GB| 1 | 1 | / |
9+
|H20 96GB| 1 | 1 | 1 |
10+
|L20 48GB| 1 | 1 | 1 |
11+
|A30 40GB| 2 | 1 | / |
12+
|A10 24GB| 2 | 1 | / |
13+
14+
**Tips:**
15+
1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 2` in starting command.
16+
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
17+
18+
### 1.2 Install fastdeploy and prepare the model
19+
- Installation: Before starting the deployment, please ensure that your hardware environment meets the following conditions:
20+
```
21+
GPU Driver >= 535
22+
CUDA >= 12.3
23+
CUDNN >= 9.5
24+
Linux X86_64
25+
Python >= 3.10
26+
```
27+
For SM 80/90 GPU(A30/A100/H100/)
28+
```
29+
# Install stable release
30+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
31+
32+
# Install latest Nightly build
33+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
34+
```
35+
For SM 86/89 GPU(A10/4090/L20/L40)
36+
```
37+
# Install stable release
38+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
39+
40+
# Install latest Nightly build
41+
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
42+
```
43+
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
44+
45+
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**
46+
- Just specify the model name(e.g. `baidu/ERNIE-4.5-21B-A3B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
47+
- If affected by network or other factors, you can also download the model through [huggingface](https://huggingface.co/), [modelscope](https://www.modelscope.cn/home), etc., and specify the model path when starting service
48+
49+
## Start the Service
50+
51+
Start the service by following command:
52+
```bash
53+
python -m fastdeploy.entrypoints.openai.api_server \
54+
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
55+
--tensor-parallel-size 1 \
56+
--quantization wint4 \
57+
--max-model-len 32768 \
58+
--kv-cache-ratio 0.75 \
59+
--max-num-seqs 128
60+
```
61+
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
62+
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
63+
- `--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
64+
65+
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
66+
67+
### 2.2 Advanced: How to get better performance
68+
#### 2.2.1 Correctly set parameters that match the application scenario
69+
Evaluate average input length, average output length, and maximum context length
70+
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
71+
- **Enable the service management global block**
72+
73+
```
74+
export ENABLE_V1_KVCACHE_SCHEDULER=1
75+
```
76+
77+
#### 2.2.2 Prefix Caching
78+
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
79+
80+
**How to enable:**
81+
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
82+
```
83+
--enable-prefix-caching
84+
--swap-space 50
85+
```
86+
87+
#### 2.2.3 Chunked Prefill
88+
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
89+
90+
**How to enable:** Add the following lines to the startup parameters
91+
```
92+
--enable-chunked-prefill
93+
```
94+
95+
#### 2.2.4 MTP (Multi-Token Prediction)
96+
**Idea:**
97+
By predicting multiple tokens at once, the number of decoding steps is reduced to significantly speed up the generation speed, while maintaining the generation quality through certain strategies. For details, please refer to [Speculative Decoding](../features/speculative_decoding.md)
98+
99+
**How to enable:**
100+
Add the following lines to the startup parameters
101+
```
102+
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
103+
```
104+
105+
#### 2.2.5 CUDAGraph
106+
**Idea:**
107+
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
108+
109+
**How to enable:**
110+
Add the following lines to the startup parameters
111+
```
112+
--use-cudagraph
113+
```
114+
Notes:
115+
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
116+
2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1`
117+
3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time
118+
119+
#### 2.2.6 Rejection Sampling
120+
**Idea:**
121+
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
122+
123+
**How to enable:**
124+
Add the following environment variables before starting
125+
```
126+
export FD_SAMPLING_CLASS=rejection
127+
```
128+
129+
## FAQ
130+
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

0 commit comments

Comments
 (0)