Skip to content

Conversation

kousakawang
Copy link
Contributor

@kousakawang kousakawang commented Aug 17, 2025

Motivation

  1. fix cutlass moe UT
  2. add optimized cutlass groupGemm instance for H20(SGL_TUNE_DEVICE_KERNEL=1 to open opt)

Modifications

Accuracy Tests

Benchmarking and Profiling

Deepseek-tp8 on H20

Batch Size origin_h20 Cutlass fused_experts Time (ms) opt_h20 Cutlass fused_experts Time (ms)
1 0.110 0.099
4 0.279 0.171
8 0.457 0.279
16 0.709 0.408
32 1.078 0.601
64 1.434 0.779
128 1.619 0.857
256 1.684 0.920
512 1.720 0.951
1024 1.798 1.022

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kousakawang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses two key areas: resolving a unit test failure in the Cutlass Mixture of Experts (MoE) module and significantly enhancing the performance of Cutlass group General Matrix Multiply (GEMM) operations on NVIDIA H20 GPUs. The unit test fix involves a minor adjustment to how topk_weights and topk_ids are passed, while the performance improvement for H20 is achieved by introducing a specifically tuned kernel configuration that can be activated via an environment variable. These changes aim to improve the robustness of the test suite and provide substantial speedups for relevant computations on H20 hardware.

Highlights

  • Cutlass MoE Unit Test Fixes and Enhancements: The pull request fixes an issue in the Cutlass Mixture of Experts (MoE) unit tests by adjusting the input format for topk_weights and topk_ids to a tuple, ensuring correct function calls within the test suite. Additionally, the test's default batch sizes were expanded to include 1024.
  • Optimized Cutlass groupGemm for H20 GPUs: This PR introduces significant performance optimizations for Cutlass group GEMM operations specifically targeting NVIDIA H20 GPUs. It adds a new, tuned configuration (sm90_fp8_pp_config_64_128_128_1_2_1) and modifies the kernel dispatch logic to apply this optimization when the SGL_TUNE_DEVICE_KERNEL environment variable is enabled and an H20 device is detected. Benchmarking results show substantial time reductions for fused_experts on H20 across various batch sizes.
  • New Device Type Detection Utility: A new utility function, isDeviceType, has been added to sgl-kernel/include/utils.h. This function allows for runtime detection of the CUDA device type by name, which is crucial for enabling device-specific optimizations like the H20 group GEMM tuning.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for the CUTLASS MoE group GEMM kernel on NVIDIA H20 GPUs, along with a fix for the corresponding unit test. The optimizations are enabled via an environment variable and demonstrate significant performance improvements. The changes are well-structured, introducing a new utility for device type detection and using macros for kernel configuration. My review includes suggestions to improve the implementation's correctness and maintainability by refining the environment variable check and reusing computed values to avoid redundancy.

if (at::cuda::getCurrentDeviceProperties()->multiProcessorCount == 78 && a.size(1) > 128) {
// For H20 with K > 128, use Pingpong Schedule
run_get_group_gemm_starts<MmaConfig0::LayoutSFA, MmaConfig0::LayoutSFB, MmaConfig0::ScaleConfig>(
bool tuning_kernel = getenv("SGL_TUNE_DEVICE_KERNEL") ? true : false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current check for the SGL_TUNE_DEVICE_KERNEL environment variable only verifies its existence, not its value. According to the pull request description, the optimization should be enabled when SGL_TUNE_DEVICE_KERNEL=1. However, this code will enable it for any non-empty string value (e.g., SGL_TUNE_DEVICE_KERNEL=0 would also enable it).

To ensure the intended behavior, you should check for the specific value "1". You could use the existing getBoolEnv helper function from utils.h for this purpose, which correctly checks for the value '1'.

  bool tuning_kernel = getBoolEnv("SGL_TUNE_DEVICE_KERNEL");

problem_sizes,
expert_offsets,
workspace);
if (at::cuda::getCurrentDeviceProperties()->multiProcessorCount == 78 && a.size(1) > 128) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better maintainability and to avoid redundant device property lookups, consider reusing the is_h20 variable which is already computed. This avoids another call to at::cuda::getCurrentDeviceProperties() and makes the condition more readable.

    if (is_h20 && a.size(1) > 128) {

@zhyncs zhyncs merged commit 0fc54b9 into sgl-project:main Aug 17, 2025
65 of 76 checks passed
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
…ce (sgl-project#9272)

Co-authored-by: wanghanpei <wanghanpei@bytedance.com>
@kousakawang
Copy link
Contributor Author

add cutlass benchmark result for groupGemm:

before:
Benchmark: expected_m_per_group=128, n=512, k=7168, num_groups=256
cutlass: 968.0480003356934 us

Benchmark: expected_m_per_group=256, n=512, k=7168, num_groups=256
cutlass: 1867.8911209106445 us

Benchmark: expected_m_per_group=256, n=256, k=7168, num_groups=256
cutlass: 1040.2175903320312 us

Benchmark: expected_m_per_group=512, n=256, k=7168, num_groups=256
cutlass: 1863.4271621704102 us

Benchmark: expected_m_per_group=1, n=512, k=7168, num_groups=256
cutlass: 960.6847763061523 us

Benchmark: expected_m_per_group=2, n=256, k=7168, num_groups=256
cutlass: 515.7408237457275 us

Benchmark: expected_m_per_group=256, n=4096, k=7168, num_groups=32
cutlass: 1866.2431716918945 us

Benchmark: expected_m_per_group=512, n=4096, k=7168, num_groups=16
cutlass: 1865.2576446533203 us

Benchmark: expected_m_per_group=4, n=4096, k=7168, num_groups=32
cutlass: 959.654426574707 us

Benchmark: expected_m_per_group=8, n=4096, k=7168, num_groups=16
cutlass: 514.9983882904053 us

Benchmark: expected_m_per_group=1024, n=768, k=4096, num_groups=128
cutlass: 3160.8959197998047 us

Benchmark: expected_m_per_group=1024, n=4096, k=384, num_groups=128
cutlass: 1621.9488143920898 us

Benchmark: expected_m_per_group=16, n=768, k=4096, num_groups=128
cutlass: 427.3087978363037 us

Benchmark: expected_m_per_group=16, n=4096, k=384, num_groups=128
cutlass: 263.9264106750488 us

after:

Benchmark: expected_m_per_group=128, n=512, k=7168, num_groups=256
cutlass: 971.9296455383301 us

Benchmark: expected_m_per_group=256, n=512, k=7168, num_groups=256
cutlass: 1868.0927276611328 us

Benchmark: expected_m_per_group=256, n=256, k=7168, num_groups=256
cutlass: 965.8687591552734 us

Benchmark: expected_m_per_group=512, n=256, k=7168, num_groups=256
cutlass: 1866.1407470703125 us

Benchmark: expected_m_per_group=1, n=512, k=7168, num_groups=256
cutlass: 516.9856071472168 us

Benchmark: expected_m_per_group=2, n=256, k=7168, num_groups=256
cutlass: 281.1072111129761 us

Benchmark: expected_m_per_group=256, n=4096, k=7168, num_groups=32
cutlass: 1868.7807083129883 us

Benchmark: expected_m_per_group=512, n=4096, k=7168, num_groups=16
cutlass: 1866.4831161499023 us

Benchmark: expected_m_per_group=4, n=4096, k=7168, num_groups=32
cutlass: 517.0495986938477 us

Benchmark: expected_m_per_group=8, n=4096, k=7168, num_groups=16
cutlass: 275.83041191101074 us

Benchmark: expected_m_per_group=1024, n=768, k=4096, num_groups=128
cutlass: 3190.518379211426 us

Benchmark: expected_m_per_group=1024, n=4096, k=384, num_groups=128
cutlass: 1634.2592239379883 us

Benchmark: expected_m_per_group=16, n=768, k=4096, num_groups=128
cutlass: 232.8927993774414 us

Benchmark: expected_m_per_group=16, n=4096, k=384, num_groups=128
cutlass: 232.81280994415283 us

@HydraQYH
Copy link
Collaborator

@kousakawang It seems that using TMA Multicast for operand A has a performance advantage in all cases. I think it would be a good idea to make this optimization the default option instead of enabling it through an environment variable.

@kousakawang
Copy link
Contributor Author

@kousakawang It seems that using TMA Multicast for operand A has a performance advantage in all cases. I think it would be a good idea to make this optimization the default option instead of enabling it through an environment variable.

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants