Add moe topk softmax templated from vllm #4302

qingquansong · 2025-03-11T08:26:48Z

Motivation

#2965

Modifications

Cherry picked current vllm MoE topk softmax kernel template (with a fix on naming typo for token_expert_indices)
Polish util func warpReduceMax / blockReduceMax for handle AMD use case as well.

Tests

Unit tests + benchmarking aligned with vllm counterpart

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

sgl-kernel/include/utils.h

sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu

yiakwy-xpu-ml-framework-team · 2025-03-15T03:57:38Z

Hi @qingquansong

once this PR #4432 is merged

#ifndef USE_ROCM
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask, width)
#else
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) __shfl_xor(var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor(var, lane_mask, width)
#endif

you can use __shfl_xor_sync directly. hence no need to have these lines.

Does it sound good to you ?

yiakwy-xpu-ml-framework-team · 2025-03-15T03:59:10Z

sgl-kernel/include/utils.h

-  max_value = fmaxf(max_value, __shfl_xor_sync(0xffffffff, max_value, 4));
-  max_value = fmaxf(max_value, __shfl_xor_sync(0xffffffff, max_value, 2));
-  max_value = fmaxf(max_value, __shfl_xor_sync(0xffffffff, max_value, 1));
+  max_value = fmaxf(max_value, SGLANG_SHFL_XOR_SYNC(0xffffffff, max_value, 16));


The modification to these function will be revert in #4432

cc @zhyncs

yiakwy-xpu-ml-framework-team · 2025-03-15T04:06:46Z

sgl-kernel/include/utils.h

+#else
+#define SGLANG_SHFL_XOR_SYNC(mask, var, lane_mask) __shfl_xor((var), (lane_mask))
+#define SGLANG_SHFL_XOR_SYNC_WIDTH(mask, var, lane_mask, width) __shfl_xor((var), (lane_mask), (width))
+#endif


I will keep these lines since you have the ROCM specific macro (since CUDA operation is no longer safe if we employ this approach) in many places. But #4432 is merged. The macro is no longer needed.

Thank you! I'll change those and remove the definition.

yiakwy-xpu-ml-framework-team · 2025-03-15T04:29:49Z

sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu

+
+  const int thread_row_offset = blockIdx.x * num_cols;
+
+  cub::Sum sum;


hipCUB is experimental one, we can try it. But it introduces new dependencies.

Could just use some simple reduction kernel ?

Definitely. We can change to customized reductions for both max and sum. I'll do it together with the macro change in a follow-up pr. How about the following one? I can test the correctness on CUDA and may need your help for AMD machine testing.

__device__ __forceinline__ float warpReduceSum(float sum_value) { sum_value += __shfl_xor_sync(0xffffffff, sum_value, 16); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 8); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 4); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 2); sum_value += __shfl_xor_sync(0xffffffff, sum_value, 1); return sum_value; } __device__ __forceinline__ float blockReduceSum(float sum_value) { static __shared__ float warpLevelSums[WARP_SIZE]; const int laneId = threadIdx.x % WARP_SIZE; const int warpId = threadIdx.x / WARP_SIZE; sum_value = warpReduceSum(sum_value); if (laneId == 0) warpLevelSums[warpId] = sum_value; __syncthreads(); sum_value = (threadIdx.x < blockDim.x / WARP_SIZE) ? warpLevelSums[laneId] : 0; if (warpId == 0) sum_value = warpReduceSum(sum_value); return sum_value; }

resovled in #4448

But also I recommend to use shlf_xor based implementation. The old solution from fasterTransformer(later incorporated into TRT-LLM) uses heavily shared memory for reduction:

const float maxElem = BlockReduce(tmpStorage).Reduce(threadData, cub::Max());

WIth shlf_xor based implementation, then you can get better result.

@qingquansong

qingquansong · 2025-03-15T04:31:41Z

Hi @qingquansong

once this PR #4432 is merged

#ifndef USE_ROCM
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor_sync(uint32_t(-1), var, lane_mask, width)
#else
  #define VLLM_SHFL_XOR_SYNC(var, lane_mask) __shfl_xor(var, lane_mask)
  #define VLLM_SHFL_XOR_SYNC_WIDTH(var, lane_mask, width) \
    __shfl_xor(var, lane_mask, width)
#endif

you can use __shfl_xor_sync directly. hence no need to have these lines.

Does it sound good to you ?

Sounds great! I'll remove the marco definition and change back to use __shfl_xor_sync directly

qingquansong requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077 and merrymercy as code owners March 11, 2025 08:26

qingquansong marked this pull request as draft March 11, 2025 08:26

hebiao064 mentioned this pull request Mar 11, 2025

[Feature] remove vllm _custom_ops #2965

Closed

18 tasks

qingquansong force-pushed the qsong/moe-topk-softmax branch 11 times, most recently from b70667a to 6200f4a Compare March 12, 2025 07:11

add moe topk softmax templated from vllm to improve

2e0cf1d

qingquansong force-pushed the qsong/moe-topk-softmax branch 2 times, most recently from 2786ad3 to acd4fb7 Compare March 13, 2025 04:34

Merge branch 'main' into qsong/moe-topk-softmax

97f4eb0

qingquansong force-pushed the qsong/moe-topk-softmax branch from acd4fb7 to 97f4eb0 Compare March 13, 2025 04:35

Merge branch 'main' into qsong/moe-topk-softmax

5b95e7d

qingquansong changed the title ~~add moe topk softmax templated from vllm to improve~~ Add moe topk softmax templated from vllm Mar 13, 2025

qingquansong added 2 commits March 13, 2025 21:57

cleanup

d7782e3

Merge branch 'main' into qsong/moe-topk-softmax

ddcafd7

qingquansong marked this pull request as ready for review March 13, 2025 21:59

hebiao064 reviewed Mar 13, 2025

View reviewed changes

sgl-kernel/include/utils.h Outdated Show resolved Hide resolved

qingquansong commented Mar 13, 2025

View reviewed changes

sgl-kernel/include/utils.h Outdated Show resolved Hide resolved

qingquansong added 2 commits March 13, 2025 16:38

Merge branch 'main' into qsong/moe-topk-softmax

81da703

reformat

0a3ff93

qingquansong force-pushed the qsong/moe-topk-softmax branch 2 times, most recently from d1b7bb2 to 6b28b88 Compare March 14, 2025 03:28

Merge branch 'main' into qsong/moe-topk-softmax

2fa0db7

qingquansong force-pushed the qsong/moe-topk-softmax branch from 6b28b88 to 2fa0db7 Compare March 14, 2025 03:29

qingquansong and others added 2 commits March 13, 2025 20:44

Merge branch 'main' into qsong/moe-topk-softmax

028f527

Merge branch 'main' into qsong/moe-topk-softmax

8a2edee

BBuf reviewed Mar 14, 2025

View reviewed changes

sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu Show resolved Hide resolved

qingquansong added 2 commits March 14, 2025 11:19

Merge branch 'main' into qsong/moe-topk-softmax

61e9609

resolve comments

c15a211

qingquansong force-pushed the qsong/moe-topk-softmax branch from c77f195 to c15a211 Compare March 14, 2025 18:37

Merge branch 'main' into qsong/moe-topk-softmax

92f8dde

qingquansong requested review from BBuf and hebiao064 March 14, 2025 18:41

zhyncs merged commit 61e4433 into sgl-project:main Mar 14, 2025
8 of 9 checks passed

yiakwy-xpu-ml-framework-team reviewed Mar 15, 2025

View reviewed changes

yiakwy-xpu-ml-framework-team mentioned this pull request Mar 15, 2025

[ROCm] enable moe topk softmax in amd #4448

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add moe topk softmax templated from vllm #4302

Add moe topk softmax templated from vllm #4302

Uh oh!

qingquansong commented Mar 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

Uh oh!

qingquansong Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Uh oh!

qingquansong Mar 15, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

Uh oh!

qingquansong commented Mar 15, 2025 •

edited

Loading

Uh oh!

Uh oh!


		const int thread_row_offset = blockIdx.x * num_cols;

		cub::Sum sum;

Add moe topk softmax templated from vllm #4302

Add moe topk softmax templated from vllm #4302

Uh oh!

Conversation

qingquansong commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

qingquansong Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

qingquansong commented Mar 11, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

qingquansong Mar 15, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 15, 2025 •

edited

Loading

qingquansong commented Mar 15, 2025 •

edited

Loading