You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
19
+
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
44
20
45
21
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**:
46
22
- Just specify the model name(e.g. `baidu/ERNIE-4.5-0.3B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
37
+
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
62
38
-`--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
63
-
-`--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
64
39
65
40
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
19
+
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
44
20
45
21
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**:
46
22
- Just specify the model name(e.g. `baidu/ERNIE-4.5-21B-A3B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
37
+
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
62
38
-`--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
63
-
-`--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
64
39
65
40
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
66
41
@@ -126,5 +101,51 @@ Add the following environment variables before starting
126
101
export FD_SAMPLING_CLASS=rejection
127
102
```
128
103
104
+
#### 2.2.7 Disaggregated Deployment
105
+
**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
106
+
107
+
**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
16
+
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
41
17
42
18
- Model Download,**Please note that models with Paddle suffix need to be used for Fastdeploy**:
43
19
- Just specify the model name(e.g. `baidu/ERNIE-4.5-300B-A47B-Paddle`)to automatically download. The default download path is `~/` (i.e. the user's home directory). You can also modify the default download path by configuring the environment variable `FD_MODEL_CACHE`
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model.
34
+
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
59
35
-`--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
60
-
-`--kv-cache-ratio`: Indicates that KVCache blocks are distributed to the Prefill stage and the Decode stage according to the kv_cache_ratio ratio. Improper settings may result in insufficient KVCache blocks in a certain stage, thus affecting performance. If the service management global block is enabled, this setting is not required.
61
36
62
37
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
0 commit comments