Skip to content

[MetaxGPU] adapt fastdeploy on metax gpu #3465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions custom_ops/setup_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,12 @@ def find_end_files(directory, end_str):
if not os.listdir(json_dir):
raise ValueError("Git clone nlohmann_json failed!")
sources = [
"gpu_ops/update_inputs_v1.cu",
"gpu_ops/save_with_output_msg.cc",
"gpu_ops/get_output.cc",
"gpu_ops/get_output_msg_with_topk.cc",
"gpu_ops/save_output_msg_with_topk.cc",
"gpu_ops/transfer_output.cc",
"gpu_ops/save_with_output.cc",
"gpu_ops/set_mask_value.cu",
"gpu_ops/set_value_by_flags.cu",
Expand Down
83 changes: 83 additions & 0 deletions docs/get_started/installation/metax_gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Metax GPU Installation for running ERNIE 4.5 Series Models

The following installation methods are available when your environment meets these requirements:
- Python >= 3.10
- Linux X86_64

Before starting, prepare a machine equipped with Enflame S60 accelerator cards. Requirements:

| Chip Type | Driver Version | KMD Version |
| :---: | :---: | :---: |
| MetaX C550 | 3.0.0.1 | 2.14.6 |

## 1. Pre-built Docker Installation (Recommended)

```shell
docker login --username=cr_temp_user --password=eyJpbnN0YW5jZUlkIjoiY3JpLXpxYTIzejI2YTU5M3R3M2QiLCJ0aW1lIjoiMTc1NTUxODEwODAwMCIsInR5cGUiOiJzdWIiLCJ1c2VySWQiOiIyMDcwOTQwMTA1NjYzNDE3OTIifQ:8226ca50ce5476c42062e24d3c465545de1c1780 cr.metax-tech.com && docker pull cr.metax-tech.com/public-library/maca-native:3.0.0.4-ubuntu20.04-amd64
```

## 2. paddlepaddle and custom device installation

```shell
1)pip install paddlepaddle==3.0.0.dev20250729 -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
2)pip install paddle-metax-gpu==3.0.0.dev20250807 -i https://www.paddlepaddle.org.cn/packages/nightly/maca/
```

## 3. Build Wheel from Source
Then clone the source code and build:
```shell
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```
The built packages will be in the ```FastDeploy/dist``` directory.

## 4. Environment Verification

After installation, verify the environment with this Python code:
```python
import paddle
from paddle.jit.marker import unified
# Verify GPU availability
paddle.utils.run_check()
# Verify FastDeploy custom operators compilation
from fastdeploy.model_executor.ops.gpu import beam_search_softmax
```

If the above code executes successfully, the environment is ready.

## 5. Demo
from fastdeploy import LLM, SamplingParams

prompts = [
"Hello. My name is",
]

sampling_params = SamplingParams(top_p=0.95, max_tokens=32, temperature=0.6)

llm = LLM(model="/root/model/ERNIE-4.5-21B-A3B-Paddle", tensor_parallel_size=1, max_model_len=256, engine_worker_queue_port=9135, quantization='wint8', static_decode_blocks=0, gpu_memory_utilization=0.9)

outputs = llm.generate(prompts, sampling_params)

print(f"Generated {len(outputs)} outputs")
print("=" * 50 + "\n")

for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print(prompt)
print(generated_text)
print("-" * 50)

Output:
INFO 2025-08-18 10:54:18,455 416822 engine.py[line:202] Waiting worker processes ready...
Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████| 100/100 [03:33<00:00, 2.14s/it]
Loading Layers: 100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:18<00:00, 5.54it/s]
INFO 2025-08-18 10:58:16,149 416822 engine.py[line:247] Worker processes are launched with 240.08204197883606 seconds.
Processed prompts: 100%|███████████████████████| 1/1 [00:21<00:00, 21.84s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Generated 1 outputs
==================================================

Hello. My name is
Alice and I'm here to help you. What can I do for you today?
Hello Alice! I'm trying to organize a small party
82 changes: 82 additions & 0 deletions docs/zh/get_started/installation/metax_gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# 使用 Metax GPU C550 运行ERNIE 4.5 系列模型

FastDeploy在Metax C550上对ERNIE 4.5系列模型进行了深度适配和优化,实现了推理入口和GPU的统一,无需修改即可完成推理任务的迁移。

环境准备:
- Python >= 3.10
- Linux X86_64

| Chip Type | Driver Version | KMD Version |
| :---: | :---: | :---: |
| MetaX C550 | 3.0.0.1 | 2.14.6 |

## 1. 容器镜像获取

```shell
docker login --username=cr_temp_user --password=eyJpbnN0YW5jZUlkIjoiY3JpLXpxYTIzejI2YTU5M3R3M2QiLCJ0aW1lIjoiMTc1NTUxODEwODAwMCIsInR5cGUiOiJzdWIiLCJ1c2VySWQiOiIyMDcwOTQwMTA1NjYzNDE3OTIifQ:8226ca50ce5476c42062e24d3c465545de1c1780 cr.metax-tech.com && docker pull cr.metax-tech.com/public-library/maca-native:3.0.0.4-ubuntu20.04-amd64
```

## 2. 预安装

```shell
1)pip install paddlepaddle==3.0.0.dev20250729 -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
2)pip install paddle-metax-gpu==3.0.0.dev20250807 -i https://www.paddlepaddle.org.cn/packages/nightly/maca/
```

## 3. FastDeploy代码下载并编译

```shell
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```
The built packages will be in the ```FastDeploy/dist``` directory.

## 4. 环境验证

After installation, verify the environment with this Python code:
```python
import paddle
from paddle.jit.marker import unified
# Verify GPU availability
paddle.utils.run_check()
# Verify FastDeploy custom operators compilation
from fastdeploy.model_executor.ops.gpu import beam_search_softmax
```
If the above code executes successfully, the environment is ready.

## 5. 示例
from fastdeploy import LLM, SamplingParams

prompts = [
"Hello. My name is",
]

sampling_params = SamplingParams(top_p=0.95, max_tokens=32, temperature=0.6)

llm = LLM(model="/root/model/ERNIE-4.5-21B-A3B-Paddle", tensor_parallel_size=1, max_model_len=256, engine_worker_queue_port=9135, quantization='wint8', static_decode_blocks=0, gpu_memory_utilization=0.9)

outputs = llm.generate(prompts, sampling_params)

print(f"Generated {len(outputs)} outputs")
print("=" * 50 + "\n")

for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print(prompt)
print(generated_text)
print("-" * 50)

输出:
INFO 2025-08-18 10:54:18,455 416822 engine.py[line:202] Waiting worker processes ready...
Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████| 100/100 [03:33<00:00, 2.14s/it]
Loading Layers: 100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:18<00:00, 5.54it/s]
INFO 2025-08-18 10:58:16,149 416822 engine.py[line:247] Worker processes are launched with 240.08204197883606 seconds.
Processed prompts: 100%|███████████████████████| 1/1 [00:21<00:00, 21.84s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Generated 1 outputs
==================================================

Hello. My name is
Alice and I'm here to help you. What can I do for you today?
Hello Alice! I'm trying to organize a small party
2 changes: 1 addition & 1 deletion fastdeploy/engine/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ def check(self):
f"should be larger than or equal to max_num_seqs: {self.max_num_seqs}"
)
assert self.max_num_batched_tokens <= self.max_model_len * self.max_num_seqs, (
f"max_num_batched_tokens: {self.max_num_batched_tokens} should be larger"
f"max_num_batched_tokens: {self.max_num_batched_tokens} should be less"
f"than or equal to max_num_seqs: {self.max_num_seqs} * max_model_len: {self.max_model_len}"
)
assert (
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ def apply_rope(self, qk, cos, sin):
)
out = paddle.add(paddle.multiply(qk, cos), paddle.multiply(rotate_half, sin))
return paddle.cast(out, qk.dtype)

@paddle.no_grad()
def forward_native_backend(
self,
q: paddle.Tensor,
Expand All @@ -273,7 +273,7 @@ def forward_native_backend(
# 1. 分离 encoder / decoder 的 mask
seq_lens_encoder = forward_meta.seq_lens_encoder.squeeze(-1)
seq_lens_decoder = forward_meta.seq_lens_decoder.squeeze(-1)
seq_lens_this_time = forward_meta.seq_lens_this_time.squeeze(-1)
seq_lens_this_time = forward_meta.seq_lens_this_time
encoder_indices = []
decoder_indices = []

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __init__(self, quant_config=None):
def process_prequanted_weights(self, layer: nn.Layer, state_dict) -> None:
"""process_prequanted_weights"""
pass

@paddle.no_grad()
def create_weights(self, layer: nn.Layer, state_dict):
"""
Triton MoE create weight process.
Expand Down Expand Up @@ -124,12 +124,12 @@ def create_weights(self, layer: nn.Layer, state_dict):
),
)
getattr(layer, scale_name).set_value(quanted_weight_scale)

@paddle.no_grad()
def apply(
self,
layer: nn.Layer,
x: paddle.Tensor,
gate_out: paddle.Tensor,
gate: nn.Layer,
) -> paddle.Tensor:
"""
Triton compute Fused MoE.
Expand All @@ -141,6 +141,7 @@ def apply(
moe_intermediate_size = layer.moe_intermediate_size
hidden_size = layer.hidden_size

gate_out = gate(x.cast("float32"))
topk_ids, topk_weights = fastdeploy.model_executor.ops.gpu.moe_topk_select(
gate_out,
layer.gate_correction_bias,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
import triton
import triton.language as tl


@triton.jit
def fused_moe_kernel_paddle(
a_ptr,
Expand Down
Loading
Loading