-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Labels
good first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is neededhigh priority
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
- Support for
silu_and_mul
andgelu_and_mul
in AMD, remove the current dependencies onvllm ops.silu_and_mul
andops.gelu_and_muli
. Used infused_moe_triton.py
. [ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform #4150 @yiakwy-xpu-ml-framework-team - remove
from vllm.model_executor.layers.activation import GeluAndMul, SiluAndMul
insglang/python/sglang/srt/layers/activation.py
. - Support GemmaRMSNorm and RMSNorm in AMD.
- remove
from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorm
insglang/python/sglang/srt/layers/layernorm.py
. - Support
rotary_embedding
kernel in AMD. - Support for
ops.moe_sum
in AMD, remove the dependency onvllm ops.moe_sum
. Used infused_moe_triton.py
. - Benchmark
vllm ops.moe_align_block_size
,moe_align_block_size_triton
, andsgl_moe_align_block_size
, and remove thenum_experts=256
limitation insgl_moe_align_block_size
. After this, directly select the kernel frommoe_align_block_size_triton
andsgl_moe_align_block_size
, and remove the dependency onvllm ops.moe_align_block_size
. Used infused_moe_triton.py
. remove moe_align vllm dep #4249 & refine sgl_moe_align_block_size_benchmark #4327 - Implement
scaled_int8_quant
insgl-kernel
and remove the current dependency onvllm ops.scaled_int8_quant
. Used infused_moe_triton.py
. @zcnrex - Implement
per_token_group_quant_int8
in CUDA, replacing the currentper_token_group_quant_int8 triton
implementation. Used infused_moe_triton.py
. @zcnrex - Support
sglang_per_token_group_quant_fp8
in AMD. Used infused_moe_triton.py
. [tools] add fp8 max/min constant in utils #3959 [ROCm] Enable per token group quant fp8 in amd #3702 @yiakwy-xpu-ml-framework-team - Implement
scaled_fp8_quant
kernel and remove the current dependency onvllm ops.scaled_fp8_quant
. (This is in progress, 50% complete—see [link]([quant kernel] sgl-kernel support per_tensor_quant fp8 #3786) for per-tensor support, and @hebiao064 is working on per-token support.vllm ops.scaled_fp8_quant
will support both per-tensor and per-token.). Used infused_moe_triton.py
,layer.py
andfp8.py
. @BBuf @hebiao064 Add sgl_per_token_quant_fp8 #4089 [Refactor] Reducing code duplication across FP8 CUDA quantization kernels #4163. https://github.com/sgl-project/sglang/pull/4231。https://github.com/sgl-project/sglang/pull/4215 - Support for
apply_rope_with_cos_sin_cache_inplace
kernel in AMD, remove the current dependencies onvllm os.rotary_embedding
. Used inrotary_embedding.py
. - Implement
topk_softmax
kernel and remove the current dependency onvllm.ops.topk_softmax
. Used intopk.py
. Add moe topk softmax templated from vllm #4302 - Support
topk_softmax
in amd. [ROCm] enable moe topk softmax in amd #4448 - Remove vllm
ops.topk_softmax
inpython/sglang/srt/layers/moe/topk.py
. remove vllm ops.topk_softmax dependency #4498 - Implement
awq_dequantize
kernel and remove the current dependency onvllm ops.awq_dequantize
. Used indeepseek_nextn.py
anddeepseek_v2.py
. Add awq dequantize kernel to sgl with 1x to 3x speedup #4104 @zcnrex
Related resources
No response
FlamingoPg, hebiao064, merrymercy and hubertlu-twyiakwy-xpu-ml-framework-teamByronHsu, DarkSharpness and yiakwy-xpu-ml-framework-team
Metadata
Metadata
Labels
good first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is neededhigh priority