-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @sogalin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces support for the Qwen3-30B-A3B-Thinking-2507 model on AMD GPUs. This is achieved by extending the Triton backend to recognize and utilize HIP (ROCm) environments, specifically adapting the Mixture-of-Experts (MoE) layer operations to function correctly on AMD hardware. Additionally, performance optimizations for MoE sum reduction are implemented, dynamically choosing between torch.compile and Triton-based methods based on token chunk size.
Highlights
- AMD GPU Support: Adds compatibility for the Qwen3-30B-A3B-Thinking-2507 model to run on AMD GPUs.
- Triton Backend Extension: Extends the Triton backend to support HIP (ROCm) environments, enabling GPU-specific kernel operations for AMD hardware.
- MoE Performance Optimization: Implements a conditional strategy for Mixture-of-Experts (MoE) sum reduction, utilizing torch.compile for small token chunks (<= 32 tokens) and a Triton-based method for larger chunks to improve performance.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Qwen3-30B-A3B-Thinking-2507 model on AMD GPUs by enabling Triton kernels for HIP. While the changes are generally in the right direction, there is a critical issue with module imports for HIP builds that could lead to runtime errors under certain configurations. Additionally, there is an opportunity to refactor some duplicated code to improve maintainability and potentially extend performance optimizations to the HIP path.
@@ -44,7 +44,7 @@ | |||
_is_cpu = is_cpu() | |||
_use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip | |||
|
|||
if _is_cuda: | |||
if _is_cuda or _is_hip: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes the elif _is_hip:
block at line 51 unreachable. This is problematic because this block contains necessary imports for HIP, such as vllm_ops
(used for gelu
activation) and aiter.moe_sum
(used when SGLANG_USE_AITER=1
). This will lead to a NameError
at runtime for certain configurations. Please ensure that all necessary modules are imported for HIP builds.
# According to micro benchmark results, torch.compile can get better performance for small token. | ||
if tokens_in_chunk <= 32: | ||
moe_sum_reduce_torch_compile( | ||
intermediate_cache3.view(*intermediate_cache3.shape), | ||
out_hidden_states[begin_chunk_idx:end_chunk_idx], | ||
routed_scaling_factor, | ||
) | ||
else: | ||
moe_sum_reduce_triton( | ||
intermediate_cache3.view(*intermediate_cache3.shape), | ||
out_hidden_states[begin_chunk_idx:end_chunk_idx], | ||
routed_scaling_factor, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for HIP is nearly identical to the logic for CUDA in lines 1602-1614. This code duplication can be avoided by refactoring and combining the logic for both _is_cuda
and _is_hip
, for example by using elif _is_cuda or _is_hip:
. This would improve maintainability. Additionally, the CUDA path includes optimizations for topk
values of 1 and 2, which could also be beneficial for the HIP path.
@@ -49,6 +49,7 @@ | |||
elif _is_cpu and _is_cpu_amx_available: | |||
pass | |||
elif _is_hip: | |||
from sgl_kernel import gelu_and_mul, silu_and_mul | |||
from vllm import _custom_ops as vllm_ops # gelu_and_mul, silu_and_mul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove from vllm import _custom_ops as vllm_ops
for moe_sum
or only add this when _is_hip==True and _is_aiter==False
? Since MI250X does not have full aiter supports I think.
@@ -1537,7 +1538,7 @@ def fused_experts_impl( | |||
gemm1_alpha, | |||
gemm1_limit, | |||
) | |||
elif _is_cuda: | |||
elif _is_cuda or _is_hip: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another gelu_and_mul
can be imported from sgl_kernel
when _is_hip ==True
.
Motivation
Add Qwen3-30B-A3B-Thinking model support on AMD GPUs.
Modifications
Enabling triton backend on AMD GPU.
Accuracy Tests
SGLANG_USE_AITER=0 python3 -m sglang.launch_server --model-path Qwen3-30B-A3B-Thinking-2507/ --tp 8 --trust-remote-code --chunked-prefill-size 130172 --max-running-requests 128 --mem-fraction-static 0.85 --attention-backend aiter --enable-torch-compile
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000
Accuracy: 0.858
Invalid: 0.000
Latency: 103.769 s
Output throughput: 2743.774 token/s
Benchmarking and Profiling
Checklist