Skip to content

[AMD] Remove the deprecated C10_WARP_SIZE #9356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 22, 2025

Conversation

hubertlu-tw
Copy link
Collaborator

Motivation

C10_WARP_SIZE is deprecated in newer pytorch versions.

Modifications

Use gfx arch to determine warp_size for ROCm.

Accuracy Tests

Benchmarking and Profiling

Checklist

CC: @sogalin @HaiShaw @zhyncs

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @hubertlu-tw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the WARP_SIZE definition within the sgl-kernel to remove the deprecated C10_WARP_SIZE macro from PyTorch. The change ensures compatibility with newer PyTorch versions by dynamically determining the warp_size for ROCm based on the graphics architecture, specifically checking for __GFX9__ to set it to 64, otherwise defaulting to 32. This update ensures proper warp_size configuration for different ROCm architectures.

Highlights

  • Deprecation Removal: Replaced the deprecated C10_WARP_SIZE macro, which was previously used for determining WARP_SIZE in ROCm environments.
  • ROCm Warp Size Logic Update: Introduced conditional compilation to determine WARP_SIZE for ROCm. If __GFX9__ is defined or __HIP_DEVICE_COMPILE__ is not defined, WARP_SIZE is set to 64; otherwise, it defaults to 32.
  • Header Clean-up: Removed unnecessary includes for ATen/cuda/CUDAContext.h and c10/macros/Macros.h as they are no longer needed for WARP_SIZE determination.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the deprecated C10_WARP_SIZE and updates the logic for determining WARP_SIZE for ROCm builds. While removing the deprecated macro is a good improvement, the new logic sets WARP_SIZE to 64 for GFX9+ architectures, which is not compatible with the existing warpReduceMax function. This function is hardcoded for a 32-wide warp, and using it with a 64-wide warp will lead to incorrect calculations. I've identified this as a critical issue and provided a suggestion for fixing the reduction function.

Comment on lines +334 to +338
#if defined(__GFX9__) || !defined(__HIP_DEVICE_COMPILE__)
#define WARP_SIZE 64
#else
#define WARP_SIZE 32
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly removes the deprecated C10_WARP_SIZE. However, by defining WARP_SIZE as 64 for GFX9+ architectures, it introduces a bug in the warpReduceMax function (lines 400-407), which is hardcoded for a warp size of 32.

With WARP_SIZE=64, warpReduceMax will only perform a partial reduction, leading to incorrect results. The blockReduceMax function (line 409) will also be affected as it relies on both warpReduceMax and WARP_SIZE.

To fix this, warpReduceMax must be updated to handle a 64-wide warp. Here is a suggested implementation:

__device__ __forceinline__ float warpReduceMax(float value) {
#if WARP_SIZE == 64
  value = fmaxf(value, __shfl_xor_sync(FULL_MASK, value, 32));
#endif
  value = fmaxf(value, __shfl_xor_sync(FULL_MASK, value, 16));
  value = fmaxf(value, __shfl_xor_sync(FULL_MASK, value, 8));
  value = fmaxf(value, __shfl_xor_sync(FULL_MASK, value, 4));
  value = fmaxf(value, __shfl_xor_sync(FULL_MASK, value, 2));
  value = fmaxf(value, __shfl_xor_sync(FULL_MASK, value, 1));
  return value;
}

@hubertlu-tw hubertlu-tw changed the title Remove the deprecated C10_WARP_SIZE [AMD] Remove the deprecated C10_WARP_SIZE Aug 19, 2025
@zhyncs zhyncs self-assigned this Aug 20, 2025
@zhyncs zhyncs merged commit 704ced1 into sgl-project:main Aug 22, 2025
91 of 93 checks passed
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Aug 22, 2025
xjpang pushed a commit to xjpang/sglang that referenced this pull request Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants