-
Notifications
You must be signed in to change notification settings - Fork 2.8k
update sgl-kernel build scripts for CUDA compatibility #5071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
25efcbf
to
8c0206e
Compare
nice work~ |
We need cu118 cu124 cu128 (for b200),can we support cu128? |
9bacb8d
to
3faf0f6
Compare
It currently works on cu118 and cu124 (and 126). For cu128, it seems more work is still needed. |
Supporting cuda12.8 is a bit beyond the scope of this pull request. The goal of this PR is to fix the original CMakeLists.txt and build scripts. |
@fecet cu118 and cu124 should not be merged. cu124 is for pypi and cu118 is for https://github.com/sgl-project/whl |
sgl-kernel/CMakeLists.txt
Outdated
|
||
if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.8" OR SGL_KERNEL_ENABLE_SM100A) | ||
if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.8" AND SGL_KERNEL_ENABLE_SM100A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the OR
logic is correct. When building, we only need to consider the CUDA version. If it is equal to or greater than 12.8, we should include the sm100a feature. Alternatively, if the user specifies sm100a, we should also enable this feature as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The building machine can be a GPU or CPU node. Please refer to this link: #5062 (comment)
sgl-kernel/CMakeLists.txt
Outdated
@@ -124,8 +123,7 @@ else() | |||
) | |||
endif() | |||
|
|||
if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.4" OR SGL_KERNEL_ENABLE_SM90A) | |||
set(BUILD_FA3 ON) | |||
if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.4" AND SGL_KERNEL_ENABLE_SM90A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume that CMakeLists.txt is only used by build.sh, then this logic would be acceptable. However, users might want to compile and package sglang according to their own requirements (I use conda to package sglang). In that case, the current logic is a bit odd. For example, even if the user explicitly disables the option, it will still be enabled if the CUDA version is high enough; or if the CUDA version is insufficient but the user enables the option, the build will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use conda to package sglang
ah it's interesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current script can actually work on cu128. There were some minor issues in my previous test. |
I think latest main can also work for cu128 |
ref #5074 |
@@ -190,7 +188,7 @@ set(SOURCES | |||
|
|||
# set flash-attention sources file | |||
# BF16 source files | |||
if (BUILD_FA3) | |||
if ("${CUDA_VERSION}" VERSION_GREATER_EQUAL "12.3" AND BUILD_FA3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good @yinfan98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
FYI Currently we use these Docker images as base image to build wheel and currently latest main works well. |
I think that you can keep |
I have reverted the logic to |
|
||
CUDA_VERSION=$1 | ||
|
||
if awk "BEGIN {exit !($CUDA_VERSION >= 12.6 && $CUDA_VERSION < 12.8)}"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hebiao064 @qingquansong May you help verify on your cu126 env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just saw it, working on it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still the same issue from our machine: https://www.diffchecker.com/ROw0JSCD/
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
Tue Apr 8 21:01:57 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 On | 00000000:04:00.0 Off | 0 |
| N/A 31C P0 77W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H200 On | 00000000:23:00.0 Off | 0 |
| N/A 30C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H200 On | 00000000:43:00.0 Off | 0 |
| N/A 32C P0 78W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H200 On | 00000000:64:00.0 Off | 0 |
| N/A 27C P0 74W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H200 On | 00000000:84:00.0 Off | 0 |
| N/A 31C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H200 On | 00000000:A3:00.0 Off | 0 |
| N/A 28C P0 75W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H200 On | 00000000:C3:00.0 Off | 0 |
| N/A 30C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H200 On | 00000000:E4:00.0 Off | 0 |
| N/A 28C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it works on lmsys's image i think it should be fine, let's not blocking this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hebiao064, did you test on bare metal or in docker? Can you tell me your test command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we don't need this anymore. May you try the latest main with this fix #5245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see the relationship between them. #5245 should be to solve the error of importing after compilation, while the code here is to solve the problem of unable to compile under cu126. According to Dao-AILab/flash-attention#1453, this should be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the latest main brach and get
2025-04-11 01:17:35 (10.4 MB/s) - ‘cmake-3.31.1-linux-x86_64.tar.gz’ saved [54989177/54989177]
PATH: /opt/cmake/bin:/usr/local/cuda/bin:/opt/rh/devtoolset-9/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
/opt/cmake/bin/cmake
cmake version 3.31.1
CMake suite maintained and supported by Kitware (kitware.com/cmake).
Looking in indexes: https://download.pytorch.org/whl/cu126
ERROR: Could not find a version that satisfies the requirement torch==2.5.1 (from versions: none)
ERROR: No matching distribution found for torch==2.5.1
for build.sh 3.9 12.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think so. For cu126, you should also use manylinux 2.28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also use torch==2.6.0, here is my result
log.txt
'[29/82] Building CUDA object CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_SM8x -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DU
SE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/sgl-kernel/include -I/sgl-kernel/csrc -I/sgl-kernel/build/_deps/repo-cutlass-src/include -I/sgl-kernel/build/_deps/repo-cutlass-src/tools/util/include -I/sgl-kernel/build/_deps/repo-flashinfer-src/include -I/sgl-kernel/build/_deps/repo-flashinfer-src/cs
rc -I/sgl-kernel/build/_deps/repo-flash-attention-src/hopper -isystem /opt/python/cp39-cp39/lib/python3.9/site-packages/torch/include -isystem /opt/python/cp39-cp39/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python/cp39-cp39/include/python3.9 -isystem /usr/local/c
uda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexpor
t_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA
=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.
dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o.d -x cu -c /sgl-kernel/build/_deps/repo-flash-attention-src/hopper/
instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/_deps/repo-flash-attention-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_softcap_sm90.cu.o
nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)
nvcc error : 'ptxas' core dumped
Overall great work! Thanks for your contribution. Once cu126 has been verified by @hebiao064 and @qingquansong. I'll merge it. Thanks! |
Others LGTM |
I will try cu126 now |
hi @zhyncs
so do I need to build from source? and what files should change? thanks |
Motivation
This PR addresses several critical issues in the current build scripts and configuration files for the sgl-kernel:
Ineffective Environment Variable Handling:
Relying on environment variables to set CMake options is ineffective because these settings may not be correctly passed to CMake. Instead, using the
-D
flag ensures that options are explicitly defined and recognized by the build system.Overly Permissive Feature Enabling:
The current logic enables
SGL_KERNEL_ENABLE_SM100A
andSGL_KERNEL_ENABLE_SM90A
based on an OR condition, which can inadvertently activate these features even when the CUDA version does not fully support them. By switching to an AND condition, the features are enabled only when the CUDA version is compatible and they are explicitly requested, leading to a more predictable and robust build process.Premature CUDA Version Selection:
The original scripts default to CUDA 12.8, which is currently unsupported by the latest PyTorch releases. This mismatch causes the scripts to fail at runtime. Adjusting the default CUDA version to 12.6 aligns with the supported versions and meets the requirements for integrating Flash Attention.
Modifications
Workflow Update:
CMakeLists.txt Changes:
OR
) with logical AND (AND
). This ensures that features such asSM100A
andSM90A
are only enabled when both the CUDA version is sufficiently high and the feature is explicitly enabled.build.sh Adjustments:
ENABLE_SM90A
andBUILD_FA3
) based on the CUDA version. For CUDA versions below 12, these features are explicitly disabled to prevent build errors.-D
parameters instead of relying solely on environment variables.New Script Addition:
replace_ptxas.sh
script to manage the replacement of theptxas
binary for specific CUDA versions. This addition ensures that the correct version ofptxas
is used, thereby maintaining compatibility and consistency across different build environments. This adjustment follows suggestions from [Dao-AILab/flash-attention issue #1453](Main branch compilation on nvcc 12.6 Dao-AILab/flash-attention#1453).Checklist