Skip to content

[doc] update docs #2690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```kv_cache_ratio``` | `float` | KVCache blocks are divided between Prefill phase and Decode phase according to kv_cache_ratio ratio, default: 0.75 |
| ```enable_prefix_caching``` | `bool` | Whether to enable Prefix Caching, default: False |
| ```swap_space``` | `float` | When Prefix Caching is enabled, CPU memory size for KVCache swapping, unit: GB, default: None |
| ```enable_chunk_prefill``` | `bool` | Enable Chunked Prefill, default: False |
| ```enable_chunked_prefill``` | `bool` | Enable Chunked Prefill, default: False |
| ```max_num_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum concurrent number of partial prefill batches, default: 1 |
| ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 |
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
Expand Down Expand Up @@ -115,5 +115,5 @@ FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter
...
```
- When ```use_cudagraph``` is enabled, currently only supports single-GPU inference, i.e. ```tensor_parallel_size``` set to 1.
- When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunk_prefill```.
- When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunked_prefill```.
- When ```use_cudagraph``` is enabled, batches with size ≤ ```max_capture_batch_size``` will be executed by CudaGraph, batches > ```max_capture_batch_size``` will be executed by original dynamic/static graph. To have all batch sizes executed by CudaGraph, ```max_capture_batch_size``` value should match ```max_num_seqs```. ```max_capture_batch_size``` > ```max_num_seqs``` will cause waste by capturing batches that won't be encountered during inference, occupying more time and memory.
4 changes: 2 additions & 2 deletions docs/zh/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
| ```kv_cache_ratio``` | `float` | KVCache块按kv_cache_ratio比例分给Prefill阶段和Decode阶段, 默认0.75 |
| ```enable_prefix_caching``` | `bool` | 是否开启Prefix Caching,默认False |
| ```swap_space``` | `float` | 开启Prefix Caching时,用于swap KVCache的CPU内存大小,单位GB,默认None |
| ```enable_chunk_prefill``` | `bool` | 开启Chunked Prefill,默认False |
| ```enable_chunked_prefill``` | `bool` | 开启Chunked Prefill,默认False |
| ```max_num_partial_prefills``` | `int` | 开启Chunked Prefill时,Prefill阶段的最大并发数,默认1 |
| ```max_long_partial_prefills``` | `int` | 开启Chunked Prefill时,Prefill阶段并发中包启的最多长请求数,默认1 |
| ```long_prefill_token_threshold``` | `int` | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 |
Expand Down Expand Up @@ -113,5 +113,5 @@ FastDeploy 的初始化顺序为先使用 `gpu_memory_utilization` 参数计算
...
```
- 当开启 ```use_cudagraph``` 时,暂时只支持单卡推理,即 ```tensor_parallel_size``` 设为1。
- 当开启 ```use_cudagraph``` 时,暂不支持开启 ```enable_prefix_caching``` 或 ```enable_chunk_prefill``` 。
- 当开启 ```use_cudagraph``` 时,暂不支持开启 ```enable_prefix_caching``` 或 ```enable_chunked_prefill``` 。
- 当开启 ```use_cudagraph``` 后,size小于等于 ```max_capture_batch_size``` 的batch会由CudaGraph来执行前向计算,大于 ```max_capture_batch_size``` 的batch会由原本的动态图/静态图执行前向计算。如果希望所有batch size均由CudaGraph来执行,```max_capture_batch_size``` 的值建议与 ```max_num_seqs``` 一致。```max_capture_batch_size``` 大于 ```max_num_seqs``` 会导致浪费,会多捕获一些推理时不会遇到的batch,占用更多时间与显存。