update sgl-kernel build scripts for CUDA compatibility #5071

fecet · 2025-04-04T20:02:51Z

Motivation

This PR addresses several critical issues in the current build scripts and configuration files for the sgl-kernel:

Ineffective Environment Variable Handling:
Relying on environment variables to set CMake options is ineffective because these settings may not be correctly passed to CMake. Instead, using the -D flag ensures that options are explicitly defined and recognized by the build system.
Overly Permissive Feature Enabling:
The current logic enables SGL_KERNEL_ENABLE_SM100A and SGL_KERNEL_ENABLE_SM90A based on an OR condition, which can inadvertently activate these features even when the CUDA version does not fully support them. By switching to an AND condition, the features are enabled only when the CUDA version is compatible and they are explicitly requested, leading to a more predictable and robust build process.
Premature CUDA Version Selection:
The original scripts default to CUDA 12.8, which is currently unsupported by the latest PyTorch releases. This mismatch causes the scripts to fail at runtime. Adjusting the default CUDA version to 12.6 aligns with the supported versions and meets the requirements for integrating Flash Attention.

Modifications

Workflow Update:
- Modified the GitHub Actions workflow to set the CUDA version to 12.6 instead of 12.8. This change ensures that the build environment is compatible with the supported PyTorch version and prevents runtime failures.
CMakeLists.txt Changes:
- Revised the conditional checks by replacing logical OR (OR) with logical AND (AND). This ensures that features such as SM100A and SM90A are only enabled when both the CUDA version is sufficiently high and the feature is explicitly enabled.
build.sh Adjustments:
- Updated the script to properly set the build flags (ENABLE_SM90A and BUILD_FA3) based on the CUDA version. For CUDA versions below 12, these features are explicitly disabled to prevent build errors.
- Enhanced the handling of configuration options by passing them via CMake’s -D parameters instead of relying solely on environment variables.
New Script Addition:
- Introduced the replace_ptxas.sh script to manage the replacement of the ptxas binary for specific CUDA versions. This addition ensures that the correct version of ptxas is used, thereby maintaining compatibility and consistency across different build environments. This adjustment follows suggestions from [Dao-AILab/flash-attention issue #1453](Main branch compilation on nvcc 12.6 Dao-AILab/flash-attention#1453).

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

FlamingoPg · 2025-04-05T03:57:22Z

nice work～

FlamingoPg · 2025-04-05T04:01:35Z

We need cu118 cu124 cu128 (for b200),can we support cu128?

.github/workflows/release-whl-kernel-cu126.yml

sgl-kernel/build.sh

sgl-kernel/replace_ptxas.sh

fecet · 2025-04-05T06:57:35Z

It currently works on cu118 and cu124 (and 126). For cu128, it seems more work is still needed.

fecet · 2025-04-05T07:10:37Z

Supporting cuda12.8 is a bit beyond the scope of this pull request. The goal of this PR is to fix the original CMakeLists.txt and build scripts.
Can this be merged first? I will study how to support cu128 in a new branch.

.github/workflows/release-whl-kernel-cu124.yml

zhyncs · 2025-04-05T07:57:10Z

@fecet cu118 and cu124 should not be merged. cu124 is for pypi and cu118 is for https://github.com/sgl-project/whl

zhyncs · 2025-04-05T08:00:13Z

sgl-kernel/CMakeLists.txt


-if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.8" OR SGL_KERNEL_ENABLE_SM100A)
+if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.8" AND SGL_KERNEL_ENABLE_SM100A)


I think the OR logic is correct. When building, we only need to consider the CUDA version. If it is equal to or greater than 12.8, we should include the sm100a feature. Alternatively, if the user specifies sm100a, we should also enable this feature as intended.

The building machine can be a GPU or CPU node. Please refer to this link: #5062 (comment)

zhyncs · 2025-04-05T08:00:25Z

sgl-kernel/CMakeLists.txt

@@ -124,8 +123,7 @@ else()
    )
 endif()

-if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.4" OR SGL_KERNEL_ENABLE_SM90A)
-    set(BUILD_FA3 ON)
+if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.4" AND SGL_KERNEL_ENABLE_SM90A)


same as above

If we assume that CMakeLists.txt is only used by build.sh, then this logic would be acceptable. However, users might want to compile and package sglang according to their own requirements (I use conda to package sglang). In that case, the current logic is a bit odd. For example, even if the user explicitly disables the option, it will still be enabled if the CUDA version is high enough; or if the CUDA version is insufficient but the user enables the option, the build will fail.

I use conda to package sglang

ah it's interesting

https://prefix.dev/channels/meta-forge/packages/sgl-kernel :)

fecet · 2025-04-05T08:02:30Z

The current script can actually work on cu128. There were some minor issues in my previous test.
log.txt

zhyncs · 2025-04-05T08:03:29Z

The current script can actually work on cu128. There were some minor issues in my previous test. log.txt

I think latest main can also work for cu128

zhyncs · 2025-04-05T08:04:38Z

ref #5074

zhyncs · 2025-04-05T08:15:30Z

sgl-kernel/CMakeLists.txt

@@ -190,7 +188,7 @@ set(SOURCES

 # set flash-attention sources file
 # BF16 source files
-if (BUILD_FA3)
+if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.3" AND BUILD_FA3)


This is good @yinfan98

zhyncs · 2025-04-05T08:46:56Z

pytorch/manylinux-builder:cuda118
pytorch/manylinux-builder:cuda124
pytorch/manylinux2_28-builder:cuda128

FYI Currently we use these Docker images as base image to build wheel and currently latest main works well.

zhyncs · 2025-04-05T08:48:05Z

I think that you can keep if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.3" AND BUILD_FA3), other parts I need to consider for now.

fecet · 2025-04-05T09:07:46Z

I have reverted the logic to SGL_KERNEL_ENABLE_SM90A and SGL_KERNEL_ENABLE_SM100A, but this indeed undermines the significance of this PR. The current main branch relies on CMakeLists.txt to adjust the options, which leads to coupling between CMakeLists.txt and build.sh, and may confuse users regarding the build behavior.
(for example, the variable ENABLE_SM90A in the script is not actually used, and it is not set to ON or OFF, but rather is a constant 0).

zhyncs · 2025-04-05T09:16:10Z

sgl-kernel/replace_ptxas.sh

+
+CUDA_VERSION=$1
+
+if awk "BEGIN {exit !($CUDA_VERSION >= 12.6 && $CUDA_VERSION < 12.8)}"; then


@hebiao064 @qingquansong May you help verify on your cu126 env?

just saw it, working on it now

if it works on lmsys's image i think it should be fine, let's not blocking this PR

Hi @hebiao064, did you test on bare metal or in docker? Can you tell me your test command?

I think that we don't need this anymore. May you try the latest main with this fix #5245

I didn't see the relationship between them. #5245 should be to solve the error of importing after compilation, while the code here is to solve the problem of unable to compile under cu126. According to Dao-AILab/flash-attention#1453, this should be necessary.

I tried the latest main brach and get

2025-04-11 01:17:35 (10.4 MB/s) - ‘cmake-3.31.1-linux-x86_64.tar.gz’ saved [54989177/54989177] PATH: /opt/cmake/bin:/usr/local/cuda/bin:/opt/rh/devtoolset-9/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin /opt/cmake/bin/cmake cmake version 3.31.1 CMake suite maintained and supported by Kitware (kitware.com/cmake). Looking in indexes: https://download.pytorch.org/whl/cu126 ERROR: Could not find a version that satisfies the requirement torch==2.5.1 (from versions: none) ERROR: No matching distribution found for torch==2.5.1

for build.sh 3.9 12.6

I don’t think so. For cu126, you should also use manylinux 2.28

I should also use torch==2.6.0, here is my result
log.txt

'[29/82] Building CUDA object CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o FAILED: CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_SM8x -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DU SE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/sgl-kernel/include -I/sgl-kernel/csrc -I/sgl-kernel/build/_deps/repo-cutlass-src/include -I/sgl-kernel/build/_deps/repo-cutlass-src/tools/util/include -I/sgl-kernel/build/_deps/repo-flashinfer-src/include -I/sgl-kernel/build/_deps/repo-flashinfer-src/cs rc -I/sgl-kernel/build/_deps/repo-flash-attention-src/hopper -isystem /opt/python/cp39-cp39/lib/python3.9/site-packages/torch/include -isystem /opt/python/cp39-cp39/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python/cp39-cp39/include/python3.9 -isystem /usr/local/c uda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexpor t_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA =1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops. dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o.d -x cu -c /sgl-kernel/build/_deps/repo-flash-attention-src/hopper/ instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference) nvcc error : 'ptxas' core dumped

sgl-kernel/build.sh

sgl-kernel/CMakeLists.txt

zhyncs · 2025-04-05T09:24:40Z

Overall great work! Thanks for your contribution. Once cu126 has been verified by @hebiao064 and @qingquansong. I'll merge it. Thanks!

FlamingoPg · 2025-04-07T06:55:24Z

Others LGTM

FlamingoPg · 2025-04-07T06:56:37Z

Overall great work! Thanks for your contribution. Once cu126 has been verified by @hebiao064 and @qingquansong. I'll merge it. Thanks!

I will try cu126 now

ltm920716 · 2025-04-09T06:49:16Z

The current script can actually work on cu128. There were some minor issues in my previous test. log.txt

I think latest main can also work for cu128

hi @zhyncs
I run lmsysorg/sglang:latest (6af4cd9b0f09) on B200 with driver 570、cuda 12.8, the contaier gets exception：

WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA B200 GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

INFO 04-09 06:42:11 __init__.py:190] Automatically detected platform cuda.
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:235: UserWarning:
NVIDIA B200 with CUDA capability sm_100 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If you want to use the NVIDIA B200 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

so do I need to build from source? and what files should change? thanks

fecet requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077, merrymercy and FlamingoPg as code owners April 4, 2025 20:02

fecet force-pushed the ci/cu12-build branch from c567ccf to cad32d6 Compare April 4, 2025 20:03

fecet changed the title ~~Update sgl-kernel Build Scripts for CUDA Compatibility~~ update sgl-kernel build scripts for CUDA compatibility Apr 4, 2025

fecet force-pushed the ci/cu12-build branch 4 times, most recently from 25efcbf to 8c0206e Compare April 5, 2025 03:55

fecet force-pushed the ci/cu12-build branch from 8c0206e to dfd70b5 Compare April 5, 2025 03:59