Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

sogalin · 2025-08-21T18:33:54Z

Motivation

Add Qwen3-30B-A3B-Thinking model support on AMD GPUs.

Modifications

Enabling triton backend on AMD GPU.

Accuracy Tests

SGLANG_USE_AITER=0 python3 -m sglang.launch_server --model-path Qwen3-30B-A3B-Thinking-2507/ --tp 8 --trust-remote-code --chunked-prefill-size 130172 --max-running-requests 128 --mem-fraction-static 0.85 --attention-backend aiter --enable-torch-compile

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000
Accuracy: 0.858
Invalid: 0.000
Latency: 103.769 s
Output throughput: 2743.774 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @sogalin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the Qwen3-30B-A3B-Thinking-2507 model on AMD GPUs. This is achieved by extending the Triton backend to recognize and utilize HIP (ROCm) environments, specifically adapting the Mixture-of-Experts (MoE) layer operations to function correctly on AMD hardware. Additionally, performance optimizations for MoE sum reduction are implemented, dynamically choosing between torch.compile and Triton-based methods based on token chunk size.

Highlights

AMD GPU Support: Adds compatibility for the Qwen3-30B-A3B-Thinking-2507 model to run on AMD GPUs.
Triton Backend Extension: Extends the Triton backend to support HIP (ROCm) environments, enabling GPU-specific kernel operations for AMD hardware.
MoE Performance Optimization: Implements a conditional strategy for Mixture-of-Experts (MoE) sum reduction, utilizing torch.compile for small token chunks (<= 32 tokens) and a Triton-based method for larger chunks to improve performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the Qwen3-30B-A3B-Thinking-2507 model on AMD GPUs by enabling Triton kernels for HIP. While the changes are generally in the right direction, there is a critical issue with module imports for HIP builds that could lead to runtime errors under certain configurations. Additionally, there is an opportunity to refactor some duplicated code to improve maintainability and potentially extend performance optimizations to the HIP path.

gemini-code-assist · 2025-08-21T18:36:37Z

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

@@ -44,7 +44,7 @@
 _is_cpu = is_cpu()
 _use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip

-if _is_cuda:
+if _is_cuda or _is_hip:


This change makes the elif _is_hip: block at line 51 unreachable. This is problematic because this block contains necessary imports for HIP, such as vllm_ops (used for gelu activation) and aiter.moe_sum (used when SGLANG_USE_AITER=1). This will lead to a NameError at runtime for certain configurations. Please ensure that all necessary modules are imported for HIP builds.

gemini-code-assist · 2025-08-21T18:36:38Z

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

+                # According to micro benchmark results, torch.compile can get better performance for small token.
+                if tokens_in_chunk <= 32:
+                    moe_sum_reduce_torch_compile(
+                        intermediate_cache3.view(*intermediate_cache3.shape),
+                        out_hidden_states[begin_chunk_idx:end_chunk_idx],
+                        routed_scaling_factor,
+                    )
+                else:
+                    moe_sum_reduce_triton(
+                        intermediate_cache3.view(*intermediate_cache3.shape),
+                        out_hidden_states[begin_chunk_idx:end_chunk_idx],
+                        routed_scaling_factor,
+                    )


This block of code for HIP is nearly identical to the logic for CUDA in lines 1602-1614. This code duplication can be avoided by refactoring and combining the logic for both _is_cuda and _is_hip, for example by using elif _is_cuda or _is_hip:. This would improve maintainability. Additionally, the CUDA path includes optimizations for topk values of 1 and 2, which could also be beneficial for the HIP path.

hubertlu-tw · 2025-08-21T20:44:43Z

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

@@ -49,6 +49,7 @@
 elif _is_cpu and _is_cpu_amx_available:
    pass
 elif _is_hip:
+    from sgl_kernel import gelu_and_mul, silu_and_mul
    from vllm import _custom_ops as vllm_ops  # gelu_and_mul, silu_and_mul


Shall we remove from vllm import _custom_ops as vllm_ops for moe_sum or only add this when _is_hip==True and _is_aiter==False? Since MI250X does not have full aiter supports I think.

hubertlu-tw · 2025-08-21T20:45:31Z

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

@@ -1537,7 +1538,7 @@ def fused_experts_impl(
                    gemm1_alpha,
                    gemm1_limit,
                )
-            elif _is_cuda:
+            elif _is_cuda or _is_hip:


There is another gelu_and_mul can be imported from sgl_kernel when _is_hip ==True.

Add Qwen3-30B-A3B-Thinking-2507 support.

458338a

sogalin requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan, BBuf, kushanam and Edwardf0t1 as code owners August 21, 2025 18:33

gemini-code-assist bot reviewed Aug 21, 2025

View reviewed changes

sogalin and others added 2 commits August 21, 2025 19:29

Fix import issue for silu and gelu.

965ac51

Merge branch 'main' into Qwen

8e74fa6

hubertlu-tw reviewed Aug 21, 2025

View reviewed changes

sogalin and others added 4 commits August 21, 2025 15:29

Merge branch 'sgl-project:main' into Qwen

89f06c0

Fix gelu case.

cb35bb2

Use aiter's get_rope and fix qwen2 get_rope

eb6f76e

Generalize aiter's get_rope usage

1ff271d

hubertlu-tw marked this pull request as draft August 21, 2025 23:33

Fix get_rope

7c1726d

hubertlu-tw marked this pull request as ready for review August 21, 2025 23:44

Merge branch 'main' into Qwen

724f566

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

sogalin commented Aug 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Uh oh!

gemini-code-assist bot Aug 21, 2025

Uh oh!

hubertlu-tw Aug 21, 2025

Uh oh!

hubertlu-tw Aug 21, 2025

Uh oh!

Uh oh!

Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

Are you sure you want to change the base?

Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. #9456

Conversation

sogalin commented Aug 21, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

hubertlu-tw Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

hubertlu-tw Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!