|
1 |
| -# FastDeploy 2.0: 大模型推理部署 |
2 |
| - |
3 | 1 | <p align="center">
|
4 |
| - <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a> |
5 |
| - <a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/FastDeploy?color=ffa"></a> |
6 |
| - <a href=""><img src="https://img.shields.io/badge/python-3.10+-aff.svg"></a> |
| 2 | + <a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a> |
| 3 | +</p> |
| 4 | +<p align="center"> |
| 5 | + <a href=""><img src="https://img.shields.io/badge/python-3.10-aff.svg"></a> |
7 | 6 | <a href=""><img src="https://img.shields.io/badge/os-linux-pink.svg"></a>
|
8 | 7 | <a href="https://github.com/PaddlePaddle/FastDeploy/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/FastDeploy?color=9ea"></a>
|
9 | 8 | <a href="https://github.com/PaddlePaddle/FastDeploy/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/FastDeploy?color=3af"></a>
|
10 | 9 | <a href="https://github.com/PaddlePaddle/FastDeploy/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/FastDeploy?color=9cc"></a>
|
11 | 10 | <a href="https://github.com/PaddlePaddle/FastDeploy/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/FastDeploy?color=ccf"></a>
|
12 | 11 | </p>
|
13 | 12 |
|
14 |
| -FastDeploy升级2.0版本支持多种大模型推理(当前仅支持Qwen2,更多模型即将更新支持),其推理部署功能涵盖: |
15 |
| - |
16 |
| -- 一行命令即可快速实现模型的服务化部署,并支持流式生成 |
17 |
| -- 利用张量并行技术加速模型推理 |
18 |
| -- 支持 PagedAttention 与 continuous batching(动态批处理) |
19 |
| -- 兼容 OpenAI 的 HTTP 协议 |
20 |
| -- 提供 Weight only int8/int4 无损压缩方案 |
21 |
| -- 支持 Prometheus Metrics 指标 |
22 |
| - |
23 |
| -> 注意: 如果你还在使用FastDeploy部署小模型(如PaddleClas/PaddleOCR等CV套件模型),请checkout [release/1.1.0分支](https://github.com/PaddlePaddle/FastDeploy/tree/release/1.1.0)。 |
24 |
| -
|
25 |
| -## 环境依赖 |
26 |
| -- A800/H800/H100 |
27 |
| -- Python>=3.10 |
28 |
| -- CUDA>=12.3 |
29 |
| -- CUDNN>=9.5 |
30 |
| -- Linux X64 |
31 |
| - |
32 |
| -## 安装 |
33 |
| - |
34 |
| -### Docker安装(推荐) |
35 |
| -``` |
36 |
| -docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:2.0.0.0-alpha |
37 |
| -``` |
38 |
| - |
39 |
| -### 源码安装 |
40 |
| -#### 安装PaddlePaddle |
41 |
| -> 注意安装nightly build版本,代码版本需新于2025.05.30,详见[PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html),指定安装CUDA 12.6 develop(Nightly build)版本。 |
42 |
| -``` |
43 |
| -python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/ |
44 |
| -``` |
45 |
| - |
46 |
| -#### 编译安装FastDeploy |
47 |
| - |
48 |
| -``` |
49 |
| -# 编译 |
50 |
| -cd FastDeploy |
51 |
| -bash build.sh |
52 |
| -# 安装 |
53 |
| -pip install dist/fastdeploy-2.0.0a0-py3-none-any.whl |
54 |
| -``` |
55 |
| - |
56 |
| -## 快速使用 |
57 |
| - |
58 |
| -在安装后,执行如下命令快速部署Qwen2模型, 更多参数的配置与含义参考[参数说明](docs/serving.md). |
59 |
| - |
60 |
| -``` shell |
61 |
| -# 下载与解压Qwen模型 |
62 |
| -wget https://fastdeploy.bj.bcebos.com/llm/models/Qwen2-7B-Instruct.tar.gz && tar xvf Qwen2-7B-Instruct.tar.gz |
63 |
| -# 指定单卡部署 |
64 |
| -python -m fastdeploy.entrypoints.openai.api_server --model ./Qwen2-7B-Instruct --port 8188 --tensor-parallel-size 1 |
65 |
| -``` |
66 |
| - |
67 |
| -使用如下命令请求模型服务 |
68 |
| -``` shell |
69 |
| -curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \ |
70 |
| --H "Content-Type: application/json" \ |
71 |
| --d '{ |
72 |
| - "messages": [ |
73 |
| - {"role": "user", "content": "你好,你的名字是什么?"} |
74 |
| - ] |
75 |
| -}' |
76 |
| -``` |
77 |
| -响应结果如下所示 |
78 |
| -``` json |
79 |
| -{ |
80 |
| - "id": "chatcmpl-db662f47-7c8c-4945-9a7a-db563b2ddd8d", |
81 |
| - "object": "chat.completion", |
82 |
| - "created": 1749451045, |
83 |
| - "model": "default", |
84 |
| - "choices": [ |
85 |
| - { |
86 |
| - "index": 0, |
87 |
| - "message": { |
88 |
| - "role": "assistant", |
89 |
| - "content": "你好!我叫通义千问。", |
90 |
| - "reasoning_content": null |
91 |
| - }, |
92 |
| - "finish_reason": "stop" |
93 |
| - } |
94 |
| - ], |
95 |
| - "usage": { |
96 |
| - "prompt_tokens": 25, |
97 |
| - "total_tokens": 35, |
98 |
| - "completion_tokens": 10, |
99 |
| - "prompt_tokens_details": null |
100 |
| - } |
101 |
| -} |
102 |
| -``` |
103 |
| -FastDeploy提供与OpenAI完全兼容的服务API(字段`model`与`api_key`目前不支持,设定会被忽略),用户也可基于openai python api请求服务。 |
104 |
| - |
105 |
| -## 部署文档 |
106 |
| -- [本地部署](docs/offline_inference.md) |
107 |
| -- [服务部署](docs/serving.md) |
108 |
| -- [服务metrics](docs/metrics.md) |
109 |
| - |
110 |
| -# 代码说明 |
111 |
| -- [代码目录说明](docs/code_guide.md) |
112 |
| -- FastDeploy的使用中存在任何建议和问题,欢迎通过issue反馈。 |
113 |
| - |
114 |
| -# 开源说明 |
115 |
| -FastDeploy遵循[Apache-2.0开源协议](./LICENSE)。 在本项目的开发中,为了对齐[vLLM](https://github.com/vllm-project/vllm)使用接口,参考和直接使用了部分vLLM代码,在此表示感谢。 |
| 13 | +<p align="center"> |
| 14 | + <a href="docs/get_started/installation/README.md"><b> Installation </b></a> |
| 15 | + | |
| 16 | + <a href="docs/get_started.md"><b> Quick Start </b></a> |
| 17 | + | |
| 18 | + <a href="docs/supported_models.md"><b> Supported Models </b></a> |
| 19 | +</p> |
| 20 | + |
| 21 | +-------------------------------------------------------------------------------- |
| 22 | +# FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle |
| 23 | + |
| 24 | +## News |
| 25 | + |
| 26 | +**[2025-06] 🔥 Released FastDeploy v2.0:** Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models. |
| 27 | + |
| 28 | +## About |
| 29 | + |
| 30 | +**FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies: |
| 31 | + |
| 32 | +- 🚀 **Load-Balanced PD Disaggregation**: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput. |
| 33 | +- 🔄 **Unified KV Cache Transmission**: Lightweight high-performance transport library with intelligent NVLink/RDMA selection. |
| 34 | +- 🤝 **OpenAI API Server and vLLM Compatible**: One-command deployment with [vLLM](https://github.com/vllm-project/vllm/) interface compatibility. |
| 35 | +- 🧮 **Comprehensive Quantization Format Support**: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more. |
| 36 | +- ⏩ **Advanced Acceleration Techniques**: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill. |
| 37 | +- 🖥️ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc. |
| 38 | + |
| 39 | +## Requirements |
| 40 | + |
| 41 | +- OS: Linux |
| 42 | +- Python: 3.10 ~ 3.12 |
| 43 | + |
| 44 | +## Installation |
| 45 | + |
| 46 | +FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, and other hardware. For detailed installation instructions: |
| 47 | + |
| 48 | +- [NVIDIA GPU](./docs/installation/nvidia_cuda.md) |
| 49 | +- [Kunlunxin XPU](./docs/en/get_started/installation/kunlunxin_xpu.md) |
| 50 | +- [Iluvatar GPU](./docs/en/get_started/installation/iluvatar_gpu.md) |
| 51 | +- [Enflame GCU](./docs/en/get_started/installation/Enflame_gcu.md) |
| 52 | + |
| 53 | +**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU, Hygon DCU, and MetaX GPU are currently under development and testing. Stay tuned for updates! |
| 54 | + |
| 55 | +## Get Started |
| 56 | + |
| 57 | +Learn how to use FastDeploy through our documentation: |
| 58 | +- [10-Minutes Quick Deployment](./docs/get_started/quick_start.md) |
| 59 | +- [ERNIE-4.5 Large Language Model Deployment](./docs/get_started/ernie-4.5.md) |
| 60 | +- [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md) |
| 61 | +- [Offline Inference Development](./docs/offline_inference.md) |
| 62 | +- [Online Service Deployment](./docs/serving/README.md) |
| 63 | +- [Full Supported Models List](./docs/supported_models.md) |
| 64 | + |
| 65 | +## Supported Models |
| 66 | + |
| 67 | +| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length | |
| 68 | +|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- | |
| 69 | +|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅(WINT4/W4A8C8/Expert Parallelism)| ✅ | ✅|✅(WINT4)| WIP |128K | |
| 70 | +|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅(WINT4/Expert Parallelism)| ✅ | ✅|✅(WINT4)| ❌ | 128K | |
| 71 | +|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K | |
| 72 | +|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K | |
| 73 | +|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K | |
| 74 | +|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K | |
| 75 | +|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K | |
| 76 | + |
| 77 | +## Advanced Usage |
| 78 | + |
| 79 | +- [Quantization](./docs/quantization/README.md) |
| 80 | +- [PD Disaggregation Deployment](./docs/features/pd_disaggregation.md) |
| 81 | +- [Speculative Decoding](./docs/features/speculative_decoding.md) |
| 82 | +- [Prefix Caching](./docs/features/prefix_caching.md) |
| 83 | +- [Chunked Prefill](./docs/features/chunked_prefill.md) |
| 84 | + |
| 85 | +## Acknowledgement |
| 86 | + |
| 87 | +FastDeploy is licensed under the [Apache-2.0 open-source license](./LICENSE). During development, portions of [vLLM](https://github.com/vllm-project/vllm) code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude. |
0 commit comments