Skip to content

Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

sogalin
Copy link
Contributor

@sogalin sogalin commented Aug 21, 2025

Motivation

Add Qwen3-30B-A3B-Thinking model support on AMD GPUs.

Modifications

Enabling triton backend on AMD GPU.

Accuracy Tests

SGLANG_USE_AITER=0 python3 -m sglang.launch_server --model-path Qwen3-30B-A3B-Thinking-2507/ --tp 8 --trust-remote-code --chunked-prefill-size 130172 --max-running-requests 128 --mem-fraction-static 0.85 --attention-backend aiter --enable-torch-compile

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000
Accuracy: 0.858
Invalid: 0.000
Latency: 103.769 s
Output throughput: 2743.774 token/s

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @sogalin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the Qwen3-30B-A3B-Thinking-2507 model on AMD GPUs. This is achieved by extending the Triton backend to recognize and utilize HIP (ROCm) environments, specifically adapting the Mixture-of-Experts (MoE) layer operations to function correctly on AMD hardware. Additionally, performance optimizations for MoE sum reduction are implemented, dynamically choosing between torch.compile and Triton-based methods based on token chunk size.

Highlights

  • AMD GPU Support: Adds compatibility for the Qwen3-30B-A3B-Thinking-2507 model to run on AMD GPUs.
  • Triton Backend Extension: Extends the Triton backend to support HIP (ROCm) environments, enabling GPU-specific kernel operations for AMD hardware.
  • MoE Performance Optimization: Implements a conditional strategy for Mixture-of-Experts (MoE) sum reduction, utilizing torch.compile for small token chunks (<= 32 tokens) and a Triton-based method for larger chunks to improve performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-30B-A3B-Thinking-2507 model on AMD GPUs by enabling Triton kernels for HIP. While the changes are generally in the right direction, there is a critical issue with module imports for HIP builds that could lead to runtime errors under certain configurations. Additionally, there is an opportunity to refactor some duplicated code to improve maintainability and potentially extend performance optimizations to the HIP path.

@@ -44,7 +44,7 @@
_is_cpu = is_cpu()
_use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip

if _is_cuda:
if _is_cuda or _is_hip:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change makes the elif _is_hip: block at line 51 unreachable. This is problematic because this block contains necessary imports for HIP, such as vllm_ops (used for gelu activation) and aiter.moe_sum (used when SGLANG_USE_AITER=1). This will lead to a NameError at runtime for certain configurations. Please ensure that all necessary modules are imported for HIP builds.

Comment on lines +1622 to +1634
# According to micro benchmark results, torch.compile can get better performance for small token.
if tokens_in_chunk <= 32:
moe_sum_reduce_torch_compile(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
routed_scaling_factor,
)
else:
moe_sum_reduce_triton(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
routed_scaling_factor,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for HIP is nearly identical to the logic for CUDA in lines 1602-1614. This code duplication can be avoided by refactoring and combining the logic for both _is_cuda and _is_hip, for example by using elif _is_cuda or _is_hip:. This would improve maintainability. Additionally, the CUDA path includes optimizations for topk values of 1 and 2, which could also be beneficial for the HIP path.

@@ -49,6 +49,7 @@
elif _is_cpu and _is_cpu_amx_available:
pass
elif _is_hip:
from sgl_kernel import gelu_and_mul, silu_and_mul
from vllm import _custom_ops as vllm_ops # gelu_and_mul, silu_and_mul
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove from vllm import _custom_ops as vllm_ops for moe_sum or only add this when _is_hip==True and _is_aiter==False? Since MI250X does not have full aiter supports I think.

@@ -1537,7 +1538,7 @@ def fused_experts_impl(
gemm1_alpha,
gemm1_limit,
)
elif _is_cuda:
elif _is_cuda or _is_hip:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another gelu_and_mul can be imported from sgl_kernel when _is_hip ==True.

@hubertlu-tw hubertlu-tw marked this pull request as draft August 21, 2025 23:33
@hubertlu-tw hubertlu-tw marked this pull request as ready for review August 21, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants