Skip to content

Commit 92c2cfa

Browse files
committed
Sync v2.0 version of code to github repo
1 parent d151496 commit 92c2cfa

File tree

597 files changed

+78819
-22948
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

597 files changed

+78819
-22948
lines changed

.clang-format

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# This file is used by clang-format to autoformat paddle source code
2+
#
3+
# The clang-format is part of llvm toolchain.
4+
# It need to install llvm and clang to format source code style.
5+
#
6+
# The basic usage is,
7+
# clang-format -i -style=file PATH/TO/SOURCE/CODE
8+
#
9+
# The -style=file implicit use ".clang-format" file located in one of
10+
# parent directory.
11+
# The -i means inplace change.
12+
#
13+
# The document of clang-format is
14+
# http://clang.llvm.org/docs/ClangFormat.html
15+
# http://clang.llvm.org/docs/ClangFormatStyleOptions.html
16+
---
17+
Language: Cpp
18+
BasedOnStyle: Google
19+
IndentWidth: 4
20+
TabWidth: 2
21+
ContinuationIndentWidth: 4
22+
AccessModifierOffset: -1 # The private/protected/public has no indent in class
23+
Standard: Cpp11
24+
AllowAllParametersOfDeclarationOnNextLine: true
25+
BinPackParameters: false
26+
BinPackArguments: false
27+
IncludeBlocks: Preserve
28+
IncludeIsMainSourceRegex: (\.cu)$
29+
...

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ dmypy.json
121121
FETCH_HEAD
122122

123123
#log
124-
log/
124+
log*/
125125

126126
checkpoints/
127127
checkpoints_origin/
@@ -158,3 +158,7 @@ custom_ops/gpu_ops/fp8_deep_gemm/deep_gemm/include/cute
158158

159159
# buff
160160
custom_ops/tmp*
161+
162+
build
163+
164+
.ccls-cache

.pre-commit-config.yaml

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ repos:
1616
rev: v0.11.7
1717
hooks:
1818
- id: ruff
19-
args: [--output-format, github, --fix]
19+
args: [--output-format, github, --fix, --line-length=120]
2020
# # 拼写检查
2121
# - repo: https://github.com/codespell-project/codespell
2222
# rev: v2.4.1
@@ -29,14 +29,15 @@ repos:
2929
rev: 6.0.1
3030
hooks:
3131
- id: isort
32-
# 格式化
33-
- repo: https://github.com/pre-commit/mirrors-clang-format
34-
rev: v20.1.3
35-
hooks:
36-
- id: clang-format
37-
# exclude: '.*'
38-
types_or: [c++, cuda]
39-
args: [--style=file, --verbose]
32+
# # 格式化
33+
# - repo: https://github.com/pre-commit/mirrors-clang-format
34+
# rev: v20.1.3
35+
# hooks:
36+
# - id: clang-format
37+
# # exclude: '.*'
38+
# types_or: [c++, cuda]
39+
# args: [--style=file, --verbose]
40+
4041
# markdown
4142
- repo: https://github.com/jackdewinter/pymarkdown
4243
rev: v0.9.29

README.md

Lines changed: 79 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -1,115 +1,87 @@
1-
# FastDeploy 2.0: 大模型推理部署
2-
31
<p align="center">
4-
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
5-
<a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/FastDeploy?color=ffa"></a>
6-
<a href=""><img src="https://img.shields.io/badge/python-3.10+-aff.svg"></a>
2+
<a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
3+
</p>
4+
<p align="center">
5+
<a href=""><img src="https://img.shields.io/badge/python-3.10-aff.svg"></a>
76
<a href=""><img src="https://img.shields.io/badge/os-linux-pink.svg"></a>
87
<a href="https://github.com/PaddlePaddle/FastDeploy/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/FastDeploy?color=9ea"></a>
98
<a href="https://github.com/PaddlePaddle/FastDeploy/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/FastDeploy?color=3af"></a>
109
<a href="https://github.com/PaddlePaddle/FastDeploy/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/FastDeploy?color=9cc"></a>
1110
<a href="https://github.com/PaddlePaddle/FastDeploy/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/FastDeploy?color=ccf"></a>
1211
</p>
1312

14-
FastDeploy升级2.0版本支持多种大模型推理(当前仅支持Qwen2,更多模型即将更新支持),其推理部署功能涵盖:
15-
16-
- 一行命令即可快速实现模型的服务化部署,并支持流式生成
17-
- 利用张量并行技术加速模型推理
18-
- 支持 PagedAttention 与 continuous batching(动态批处理)
19-
- 兼容 OpenAI 的 HTTP 协议
20-
- 提供 Weight only int8/int4 无损压缩方案
21-
- 支持 Prometheus Metrics 指标
22-
23-
> 注意: 如果你还在使用FastDeploy部署小模型(如PaddleClas/PaddleOCR等CV套件模型),请checkout [release/1.1.0分支](https://github.com/PaddlePaddle/FastDeploy/tree/release/1.1.0)
24-
25-
## 环境依赖
26-
- A800/H800/H100
27-
- Python>=3.10
28-
- CUDA>=12.3
29-
- CUDNN>=9.5
30-
- Linux X64
31-
32-
## 安装
33-
34-
### Docker安装(推荐)
35-
```
36-
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:2.0.0.0-alpha
37-
```
38-
39-
### 源码安装
40-
#### 安装PaddlePaddle
41-
> 注意安装nightly build版本,代码版本需新于2025.05.30,详见[PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html),指定安装CUDA 12.6 develop(Nightly build)版本。
42-
```
43-
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
44-
```
45-
46-
#### 编译安装FastDeploy
47-
48-
```
49-
# 编译
50-
cd FastDeploy
51-
bash build.sh
52-
# 安装
53-
pip install dist/fastdeploy-2.0.0a0-py3-none-any.whl
54-
```
55-
56-
## 快速使用
57-
58-
在安装后,执行如下命令快速部署Qwen2模型, 更多参数的配置与含义参考[参数说明](docs/serving.md).
59-
60-
``` shell
61-
# 下载与解压Qwen模型
62-
wget https://fastdeploy.bj.bcebos.com/llm/models/Qwen2-7B-Instruct.tar.gz && tar xvf Qwen2-7B-Instruct.tar.gz
63-
# 指定单卡部署
64-
python -m fastdeploy.entrypoints.openai.api_server --model ./Qwen2-7B-Instruct --port 8188 --tensor-parallel-size 1
65-
```
66-
67-
使用如下命令请求模型服务
68-
``` shell
69-
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
70-
-H "Content-Type: application/json" \
71-
-d '{
72-
"messages": [
73-
{"role": "user", "content": "你好,你的名字是什么?"}
74-
]
75-
}'
76-
```
77-
响应结果如下所示
78-
``` json
79-
{
80-
"id": "chatcmpl-db662f47-7c8c-4945-9a7a-db563b2ddd8d",
81-
"object": "chat.completion",
82-
"created": 1749451045,
83-
"model": "default",
84-
"choices": [
85-
{
86-
"index": 0,
87-
"message": {
88-
"role": "assistant",
89-
"content": "你好!我叫通义千问。",
90-
"reasoning_content": null
91-
},
92-
"finish_reason": "stop"
93-
}
94-
],
95-
"usage": {
96-
"prompt_tokens": 25,
97-
"total_tokens": 35,
98-
"completion_tokens": 10,
99-
"prompt_tokens_details": null
100-
}
101-
}
102-
```
103-
FastDeploy提供与OpenAI完全兼容的服务API(字段`model``api_key`目前不支持,设定会被忽略),用户也可基于openai python api请求服务。
104-
105-
## 部署文档
106-
- [本地部署](docs/offline_inference.md)
107-
- [服务部署](docs/serving.md)
108-
- [服务metrics](docs/metrics.md)
109-
110-
# 代码说明
111-
- [代码目录说明](docs/code_guide.md)
112-
- FastDeploy的使用中存在任何建议和问题,欢迎通过issue反馈。
113-
114-
# 开源说明
115-
FastDeploy遵循[Apache-2.0开源协议](./LICENSE)。 在本项目的开发中,为了对齐[vLLM](https://github.com/vllm-project/vllm)使用接口,参考和直接使用了部分vLLM代码,在此表示感谢。
13+
<p align="center">
14+
<a href="docs/get_started/installation/README.md"><b> Installation </b></a>
15+
|
16+
<a href="docs/get_started.md"><b> Quick Start </b></a>
17+
|
18+
<a href="docs/supported_models.md"><b> Supported Models </b></a>
19+
</p>
20+
21+
--------------------------------------------------------------------------------
22+
# FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
23+
24+
## News
25+
26+
**[2025-06] 🔥 Released FastDeploy v2.0:** Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.
27+
28+
## About
29+
30+
**FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies:
31+
32+
- 🚀 **Load-Balanced PD Disaggregation**: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
33+
- 🔄 **Unified KV Cache Transmission**: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
34+
- 🤝 **OpenAI API Server and vLLM Compatible**: One-command deployment with [vLLM](https://github.com/vllm-project/vllm/) interface compatibility.
35+
- 🧮 **Comprehensive Quantization Format Support**: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
36+
-**Advanced Acceleration Techniques**: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
37+
- 🖥️ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.
38+
39+
## Requirements
40+
41+
- OS: Linux
42+
- Python: 3.10 ~ 3.12
43+
44+
## Installation
45+
46+
FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, and other hardware. For detailed installation instructions:
47+
48+
- [NVIDIA GPU](./docs/installation/nvidia_cuda.md)
49+
- [Kunlunxin XPU](./docs/en/get_started/installation/kunlunxin_xpu.md)
50+
- [Iluvatar GPU](./docs/en/get_started/installation/iluvatar_gpu.md)
51+
- [Enflame GCU](./docs/en/get_started/installation/Enflame_gcu.md)
52+
53+
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU, Hygon DCU, and MetaX GPU are currently under development and testing. Stay tuned for updates!
54+
55+
## Get Started
56+
57+
Learn how to use FastDeploy through our documentation:
58+
- [10-Minutes Quick Deployment](./docs/get_started/quick_start.md)
59+
- [ERNIE-4.5 Large Language Model Deployment](./docs/get_started/ernie-4.5.md)
60+
- [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md)
61+
- [Offline Inference Development](./docs/offline_inference.md)
62+
- [Online Service Deployment](./docs/serving/README.md)
63+
- [Full Supported Models List](./docs/supported_models.md)
64+
65+
## Supported Models
66+
67+
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
68+
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
69+
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅(WINT4/W4A8C8/Expert Parallelism)|||✅(WINT4)| WIP |128K |
70+
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅(WINT4/Expert Parallelism)|||✅(WINT4)|| 128K |
71+
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
72+
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
73+
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 |||| WIP ||128K |
74+
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 |||| WIP ||128K |
75+
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 |||||| 128K |
76+
77+
## Advanced Usage
78+
79+
- [Quantization](./docs/quantization/README.md)
80+
- [PD Disaggregation Deployment](./docs/features/pd_disaggregation.md)
81+
- [Speculative Decoding](./docs/features/speculative_decoding.md)
82+
- [Prefix Caching](./docs/features/prefix_caching.md)
83+
- [Chunked Prefill](./docs/features/chunked_prefill.md)
84+
85+
## Acknowledgement
86+
87+
FastDeploy is licensed under the [Apache-2.0 open-source license](./LICENSE). During development, portions of [vLLM](https://github.com/vllm-project/vllm) code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

benchmarks/README.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
### FastDeploy服务化性能压测工具
2+
3+
#### 数据集:
4+
5+
wget下载到本地用于性能测试
6+
7+
<table style="width:100%; border-collapse: collapse;">
8+
<thead>
9+
<tr>
10+
<th style="width:15%; text-align: left;">Dataset</th>
11+
<th style="width:65%; text-align: left;">Data Path</th>
12+
</tr>
13+
</thead>
14+
<tbody>
15+
<tr>
16+
<td><strong>开源数据集 2k条</strong></td>
17+
<td><code>https://fastdeploy.bj.bcebos.com/eb_query/filtered_sharedgpt_2000_input_1136_output_200_fd.json</code></td>
18+
</tr>
19+
</tbody>
20+
</table>
21+
#### 使用方式:
22+
23+
```
24+
# 安装依赖
25+
python -m pip install -r requirements.txt
26+
```
27+
28+
##### 参数说明
29+
30+
```bash
31+
--backend openai-chat:压测使用的后端接口,指定为"openai-chat"使用chat/completion接口
32+
--model EB45T:模型名,任意取名,影响最后保存的结果文件名 EB45T \
33+
--endpoint /v1/chat/completions:endpoint,用于组url
34+
--host 0.0.0.0:服务ip地址,用于组url
35+
--port 9812:服务HTTP端口,用于组url
36+
--dataset-name EBChat:指定数据集类,指定为"EBChat"可读取转存的FD格式数据集
37+
--dataset-path ./eb45t_spv4_dataserver_1w_waigua_fd:压测数据集路径
38+
--hyperparameter-path EB45T.yaml:(可选)超参文件,请求时会更新进payload中,默认不带任何超参
39+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len:性能结果中展示的指标集合
40+
--metric-percentiles 80,95,99,99.9,99.95,99.99:性能结果中展示的性能指标分位值
41+
--num-prompts 1:总计发送多少条请求
42+
--max-concurrency 1:压测并发数
43+
--save-result:开启结果保存,结果文件会存入json
44+
```
45+
46+
##### /v1/chat/completions接口压测单条数据调试
47+
48+
```
49+
python benchmark_serving.py \
50+
--backend openai-chat \
51+
--model EB45T \
52+
--endpoint /v1/chat/completions \
53+
--host 0.0.0.0 \
54+
--port 9812 \
55+
--dataset-name EBChat \
56+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
57+
--hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
58+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
59+
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
60+
--num-prompts 1 \
61+
--max-concurrency 1 \
62+
--save-result
63+
```
64+
65+
##### /v1/chat/completions接口完整100并发 2000条压测
66+
67+
```
68+
# 保存infer_log.txt
69+
python benchmark_serving.py \
70+
--backend openai-chat \
71+
--model EB45T \
72+
--endpoint /v1/chat/completions \
73+
--host 0.0.0.0 \
74+
--port 9812 \
75+
--dataset-name EBChat \
76+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
77+
--hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
78+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
79+
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
80+
--num-prompts 2000 \
81+
--max-concurrency 100 \
82+
--save-result > infer_log.txt 2>&1 &
83+
```
84+
85+
##### /v1/completions接口压测
86+
87+
修改endpoint为/v1/completions,backend为openai,会对/v1/completions接口进行压测
88+
89+
```
90+
# 保存infer_log.txt
91+
python benchmark_serving.py \
92+
--backend openai \
93+
--model EB45T \
94+
--endpoint /v1/completions \
95+
--host 0.0.0.0 \
96+
--port 9812 \
97+
--dataset-name EBChat \
98+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
99+
--hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
100+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
101+
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
102+
--num-prompts 2000 \
103+
--max-concurrency 100 \
104+
--save-result > infer_log.txt 2>&1 &
105+
```
106+

0 commit comments

Comments
 (0)