[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272

kousakawang · 2025-08-17T10:18:15Z

Motivation

fix cutlass moe UT
add optimized cutlass groupGemm instance for H20(SGL_TUNE_DEVICE_KERNEL=1 to open opt)

Modifications

Accuracy Tests

Benchmarking and Profiling

Deepseek-tp8 on H20

Batch Size	origin_h20 Cutlass fused_experts Time (ms)	opt_h20 Cutlass fused_experts Time (ms)
1	0.110	0.099
4	0.279	0.171
8	0.457	0.279
16	0.709	0.408
32	1.078	0.601
64	1.434	0.779
128	1.619	0.857
256	1.684	0.920
512	1.720	0.951
1024	1.798	1.022

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @kousakawang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses two key areas: resolving a unit test failure in the Cutlass Mixture of Experts (MoE) module and significantly enhancing the performance of Cutlass group General Matrix Multiply (GEMM) operations on NVIDIA H20 GPUs. The unit test fix involves a minor adjustment to how topk_weights and topk_ids are passed, while the performance improvement for H20 is achieved by introducing a specifically tuned kernel configuration that can be activated via an environment variable. These changes aim to improve the robustness of the test suite and provide substantial speedups for relevant computations on H20 hardware.

Highlights

Cutlass MoE Unit Test Fixes and Enhancements: The pull request fixes an issue in the Cutlass Mixture of Experts (MoE) unit tests by adjusting the input format for topk_weights and topk_ids to a tuple, ensuring correct function calls within the test suite. Additionally, the test's default batch sizes were expanded to include 1024.
Optimized Cutlass groupGemm for H20 GPUs: This PR introduces significant performance optimizations for Cutlass group GEMM operations specifically targeting NVIDIA H20 GPUs. It adds a new, tuned configuration (sm90_fp8_pp_config_64_128_128_1_2_1) and modifies the kernel dispatch logic to apply this optimization when the SGL_TUNE_DEVICE_KERNEL environment variable is enabled and an H20 device is detected. Benchmarking results show substantial time reductions for fused_experts on H20 across various batch sizes.
New Device Type Detection Utility: A new utility function, isDeviceType, has been added to sgl-kernel/include/utils.h. This function allows for runtime detection of the CUDA device type by name, which is crucial for enabling device-specific optimizations like the H20 group GEMM tuning.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces performance optimizations for the CUTLASS MoE group GEMM kernel on NVIDIA H20 GPUs, along with a fix for the corresponding unit test. The optimizations are enabled via an environment variable and demonstrate significant performance improvements. The changes are well-structured, introducing a new utility for device type detection and using macros for kernel configuration. My review includes suggestions to improve the implementation's correctness and maintainability by refining the environment variable check and reusing computed values to avoid redundancy.

gemini-code-assist · 2025-08-17T10:19:39Z

sgl-kernel/csrc/moe/fp8_blockwise_moe_kernel.cu

-  if (at::cuda::getCurrentDeviceProperties()->multiProcessorCount == 78 && a.size(1) > 128) {
-    // For H20 with K > 128, use Pingpong Schedule
-    run_get_group_gemm_starts<MmaConfig0::LayoutSFA, MmaConfig0::LayoutSFB, MmaConfig0::ScaleConfig>(
+  bool tuning_kernel = getenv("SGL_TUNE_DEVICE_KERNEL") ? true : false;


The current check for the SGL_TUNE_DEVICE_KERNEL environment variable only verifies its existence, not its value. According to the pull request description, the optimization should be enabled when SGL_TUNE_DEVICE_KERNEL=1. However, this code will enable it for any non-empty string value (e.g., SGL_TUNE_DEVICE_KERNEL=0 would also enable it).

To ensure the intended behavior, you should check for the specific value "1". You could use the existing getBoolEnv helper function from utils.h for this purpose, which correctly checks for the value '1'.

bool tuning_kernel = getBoolEnv("SGL_TUNE_DEVICE_KERNEL");

gemini-code-assist · 2025-08-17T10:19:39Z

sgl-kernel/csrc/moe/fp8_blockwise_moe_kernel.cu

-        problem_sizes,
-        expert_offsets,
-        workspace);
+    if (at::cuda::getCurrentDeviceProperties()->multiProcessorCount == 78 && a.size(1) > 128) {


For better maintainability and to avoid redundant device property lookups, consider reusing the is_h20 variable which is already computed. This avoids another call to at::cuda::getCurrentDeviceProperties() and makes the condition more readable.

if (is_h20 && a.size(1) > 128) {

…ce (sgl-project#9272) Co-authored-by: wanghanpei <wanghanpei@bytedance.com>

kousakawang · 2025-08-18T02:17:20Z

add cutlass benchmark result for groupGemm:

before:
Benchmark: expected_m_per_group=128, n=512, k=7168, num_groups=256
cutlass: 968.0480003356934 us

Benchmark: expected_m_per_group=256, n=512, k=7168, num_groups=256
cutlass: 1867.8911209106445 us

Benchmark: expected_m_per_group=256, n=256, k=7168, num_groups=256
cutlass: 1040.2175903320312 us

Benchmark: expected_m_per_group=512, n=256, k=7168, num_groups=256
cutlass: 1863.4271621704102 us

Benchmark: expected_m_per_group=1, n=512, k=7168, num_groups=256
cutlass: 960.6847763061523 us

Benchmark: expected_m_per_group=2, n=256, k=7168, num_groups=256
cutlass: 515.7408237457275 us

Benchmark: expected_m_per_group=256, n=4096, k=7168, num_groups=32
cutlass: 1866.2431716918945 us

Benchmark: expected_m_per_group=512, n=4096, k=7168, num_groups=16
cutlass: 1865.2576446533203 us

Benchmark: expected_m_per_group=4, n=4096, k=7168, num_groups=32
cutlass: 959.654426574707 us

Benchmark: expected_m_per_group=8, n=4096, k=7168, num_groups=16
cutlass: 514.9983882904053 us

Benchmark: expected_m_per_group=1024, n=768, k=4096, num_groups=128
cutlass: 3160.8959197998047 us

Benchmark: expected_m_per_group=1024, n=4096, k=384, num_groups=128
cutlass: 1621.9488143920898 us

Benchmark: expected_m_per_group=16, n=768, k=4096, num_groups=128
cutlass: 427.3087978363037 us

Benchmark: expected_m_per_group=16, n=4096, k=384, num_groups=128
cutlass: 263.9264106750488 us

after:

Benchmark: expected_m_per_group=128, n=512, k=7168, num_groups=256
cutlass: 971.9296455383301 us

Benchmark: expected_m_per_group=256, n=512, k=7168, num_groups=256
cutlass: 1868.0927276611328 us

Benchmark: expected_m_per_group=256, n=256, k=7168, num_groups=256
cutlass: 965.8687591552734 us

Benchmark: expected_m_per_group=512, n=256, k=7168, num_groups=256
cutlass: 1866.1407470703125 us

Benchmark: expected_m_per_group=1, n=512, k=7168, num_groups=256
cutlass: 516.9856071472168 us

Benchmark: expected_m_per_group=2, n=256, k=7168, num_groups=256
cutlass: 281.1072111129761 us

Benchmark: expected_m_per_group=256, n=4096, k=7168, num_groups=32
cutlass: 1868.7807083129883 us

Benchmark: expected_m_per_group=512, n=4096, k=7168, num_groups=16
cutlass: 1866.4831161499023 us

Benchmark: expected_m_per_group=4, n=4096, k=7168, num_groups=32
cutlass: 517.0495986938477 us

Benchmark: expected_m_per_group=8, n=4096, k=7168, num_groups=16
cutlass: 275.83041191101074 us

Benchmark: expected_m_per_group=1024, n=768, k=4096, num_groups=128
cutlass: 3190.518379211426 us

Benchmark: expected_m_per_group=1024, n=4096, k=384, num_groups=128
cutlass: 1634.2592239379883 us

Benchmark: expected_m_per_group=16, n=768, k=4096, num_groups=128
cutlass: 232.8927993774414 us

Benchmark: expected_m_per_group=16, n=4096, k=384, num_groups=128
cutlass: 232.81280994415283 us

HydraQYH · 2025-08-18T14:22:30Z

@kousakawang It seems that using TMA Multicast for operand A has a performance advantage in all cases. I think it would be a good idea to make this optimization the default option instead of enabling it through an environment variable.

kousakawang · 2025-08-18T14:54:18Z

@kousakawang It seems that using TMA Multicast for operand A has a performance advantage in all cases. I think it would be a good idea to make this optimization the default option instead of enabling it through an environment variable.

OK

wanghanpei added 2 commits August 17, 2025 09:24

apply end line

909a366

fix cutlass-moe ut

c286b14

kousakawang requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077, merrymercy, FlamingoPg and HaiShaw as code owners August 17, 2025 10:18

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

replace get env

0385565

zhyncs assigned Alcanderian and HydraQYH Aug 17, 2025

zhyncs merged commit 0fc54b9 into sgl-project:main Aug 17, 2025
65 of 76 checks passed

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performan…

09ead08

…ce (sgl-project#9272) Co-authored-by: wanghanpei <wanghanpei@bytedance.com>

kousakawang mentioned this pull request Aug 19, 2025

[code clean] add H20 cutlass groupGemm default config #9333

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272

[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272

kousakawang commented Aug 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 17, 2025

Uh oh!

gemini-code-assist bot Aug 17, 2025

Uh oh!

Uh oh!

kousakawang commented Aug 18, 2025

Uh oh!

HydraQYH commented Aug 18, 2025

Uh oh!

kousakawang commented Aug 18, 2025

Uh oh!

Uh oh!

[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272

[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272

Conversation

kousakawang commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kousakawang commented Aug 18, 2025

Uh oh!

HydraQYH commented Aug 18, 2025

Uh oh!

kousakawang commented Aug 18, 2025

Uh oh!

Uh oh!

kousakawang commented Aug 17, 2025 •

edited

Loading