Skip to content

Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 15 additions & 6 deletions python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
_is_cpu = is_cpu()
_use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip

if _is_cuda:
if _is_cuda or _is_hip:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change makes the elif _is_hip: block at line 51 unreachable. This is problematic because this block contains necessary imports for HIP, such as vllm_ops (used for gelu activation) and aiter.moe_sum (used when SGLANG_USE_AITER=1). This will lead to a NameError at runtime for certain configurations. Please ensure that all necessary modules are imported for HIP builds.

from sgl_kernel import gelu_and_mul, silu_and_mul
elif _is_cpu and _is_cpu_amx_available:
pass
Expand Down Expand Up @@ -1537,7 +1537,7 @@ def fused_experts_impl(
gemm1_alpha,
gemm1_limit,
)
elif _is_cuda:
elif _is_cuda or _is_hip:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another gelu_and_mul can be imported from sgl_kernel when _is_hip ==True.

silu_and_mul(intermediate_cache1.view(-1, N), intermediate_cache2)
else:
vllm_ops.silu_and_mul(
Expand Down Expand Up @@ -1619,10 +1619,19 @@ def fused_experts_impl(
out_hidden_states[begin_chunk_idx:end_chunk_idx],
)
else:
vllm_ops.moe_sum(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
)
# According to micro benchmark results, torch.compile can get better performance for small token.
if tokens_in_chunk <= 32:
moe_sum_reduce_torch_compile(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
routed_scaling_factor,
)
else:
moe_sum_reduce_triton(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
routed_scaling_factor,
)
Comment on lines +1624 to +1636
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for HIP is nearly identical to the logic for CUDA in lines 1602-1614. This code duplication can be avoided by refactoring and combining the logic for both _is_cuda and _is_hip, for example by using elif _is_cuda or _is_hip:. This would improve maintainability. Additionally, the CUDA path includes optimizations for topk values of 1 and 2, which could also be beneficial for the HIP path.

else:
vllm_ops.moe_sum(
intermediate_cache3.view(*intermediate_cache3.shape),
Expand Down