-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance #9272
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kousakawang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses two key areas: resolving a unit test failure in the Cutlass Mixture of Experts (MoE) module and significantly enhancing the performance of Cutlass group General Matrix Multiply (GEMM) operations on NVIDIA H20 GPUs. The unit test fix involves a minor adjustment to how topk_weights
and topk_ids
are passed, while the performance improvement for H20 is achieved by introducing a specifically tuned kernel configuration that can be activated via an environment variable. These changes aim to improve the robustness of the test suite and provide substantial speedups for relevant computations on H20 hardware.
Highlights
- Cutlass MoE Unit Test Fixes and Enhancements: The pull request fixes an issue in the Cutlass Mixture of Experts (MoE) unit tests by adjusting the input format for
topk_weights
andtopk_ids
to a tuple, ensuring correct function calls within the test suite. Additionally, the test's default batch sizes were expanded to include1024
. - Optimized Cutlass groupGemm for H20 GPUs: This PR introduces significant performance optimizations for Cutlass group GEMM operations specifically targeting NVIDIA H20 GPUs. It adds a new, tuned configuration (
sm90_fp8_pp_config_64_128_128_1_2_1
) and modifies the kernel dispatch logic to apply this optimization when theSGL_TUNE_DEVICE_KERNEL
environment variable is enabled and an H20 device is detected. Benchmarking results show substantial time reductions forfused_experts
on H20 across various batch sizes. - New Device Type Detection Utility: A new utility function,
isDeviceType
, has been added tosgl-kernel/include/utils.h
. This function allows for runtime detection of the CUDA device type by name, which is crucial for enabling device-specific optimizations like the H20 group GEMM tuning.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance optimizations for the CUTLASS MoE group GEMM kernel on NVIDIA H20 GPUs, along with a fix for the corresponding unit test. The optimizations are enabled via an environment variable and demonstrate significant performance improvements. The changes are well-structured, introducing a new utility for device type detection and using macros for kernel configuration. My review includes suggestions to improve the implementation's correctness and maintainability by refining the environment variable check and reusing computed values to avoid redundancy.
if (at::cuda::getCurrentDeviceProperties()->multiProcessorCount == 78 && a.size(1) > 128) { | ||
// For H20 with K > 128, use Pingpong Schedule | ||
run_get_group_gemm_starts<MmaConfig0::LayoutSFA, MmaConfig0::LayoutSFB, MmaConfig0::ScaleConfig>( | ||
bool tuning_kernel = getenv("SGL_TUNE_DEVICE_KERNEL") ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current check for the SGL_TUNE_DEVICE_KERNEL
environment variable only verifies its existence, not its value. According to the pull request description, the optimization should be enabled when SGL_TUNE_DEVICE_KERNEL=1
. However, this code will enable it for any non-empty string value (e.g., SGL_TUNE_DEVICE_KERNEL=0
would also enable it).
To ensure the intended behavior, you should check for the specific value "1". You could use the existing getBoolEnv
helper function from utils.h
for this purpose, which correctly checks for the value '1'.
bool tuning_kernel = getBoolEnv("SGL_TUNE_DEVICE_KERNEL");
problem_sizes, | ||
expert_offsets, | ||
workspace); | ||
if (at::cuda::getCurrentDeviceProperties()->multiProcessorCount == 78 && a.size(1) > 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ce (sgl-project#9272) Co-authored-by: wanghanpei <wanghanpei@bytedance.com>
add cutlass benchmark result for groupGemm: before: Benchmark: expected_m_per_group=256, n=512, k=7168, num_groups=256 Benchmark: expected_m_per_group=256, n=256, k=7168, num_groups=256 Benchmark: expected_m_per_group=512, n=256, k=7168, num_groups=256 Benchmark: expected_m_per_group=1, n=512, k=7168, num_groups=256 Benchmark: expected_m_per_group=2, n=256, k=7168, num_groups=256 Benchmark: expected_m_per_group=256, n=4096, k=7168, num_groups=32 Benchmark: expected_m_per_group=512, n=4096, k=7168, num_groups=16 Benchmark: expected_m_per_group=4, n=4096, k=7168, num_groups=32 Benchmark: expected_m_per_group=8, n=4096, k=7168, num_groups=16 Benchmark: expected_m_per_group=1024, n=768, k=4096, num_groups=128 Benchmark: expected_m_per_group=1024, n=4096, k=384, num_groups=128 Benchmark: expected_m_per_group=16, n=768, k=4096, num_groups=128 Benchmark: expected_m_per_group=16, n=4096, k=384, num_groups=128 after: Benchmark: expected_m_per_group=128, n=512, k=7168, num_groups=256 Benchmark: expected_m_per_group=256, n=512, k=7168, num_groups=256 Benchmark: expected_m_per_group=256, n=256, k=7168, num_groups=256 Benchmark: expected_m_per_group=512, n=256, k=7168, num_groups=256 Benchmark: expected_m_per_group=1, n=512, k=7168, num_groups=256 Benchmark: expected_m_per_group=2, n=256, k=7168, num_groups=256 Benchmark: expected_m_per_group=256, n=4096, k=7168, num_groups=32 Benchmark: expected_m_per_group=512, n=4096, k=7168, num_groups=16 Benchmark: expected_m_per_group=4, n=4096, k=7168, num_groups=32 Benchmark: expected_m_per_group=8, n=4096, k=7168, num_groups=16 Benchmark: expected_m_per_group=1024, n=768, k=4096, num_groups=128 Benchmark: expected_m_per_group=1024, n=4096, k=384, num_groups=128 Benchmark: expected_m_per_group=16, n=768, k=4096, num_groups=128 Benchmark: expected_m_per_group=16, n=4096, k=384, num_groups=128 |
@kousakawang It seems that using TMA Multicast for operand A has a performance advantage in all cases. I think it would be a good idea to make this optimization the default option instead of enabling it through an environment variable. |
OK |
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Deepseek-tp8 on H20
Checklist