Skip to content

Commit 8ef8859

Browse files
msinnha1zhyncsDefTruthfzyzcjyguoyuhong
authored
Rebase_4_6_0_post_1 to master_next (sgl-project#31)
* fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Fix several minor issues in PD disaggregation (sgl-project#5444) * [doc] Update benchmark_and_profiling.md (sgl-project#5449) * Update cutlass dependency. (sgl-project#5447) * add multi-lora feature in README.md (sgl-project#5463) * Clean up imports (sgl-project#5467) * [verl] Modify the update_weights func to align with verl's resharding (sgl-project#5345) Co-authored-by: Chayenne <zhaochen20@outlook.com> * [Model Support] unsloth/Phi-4-mini bnb model (sgl-project#4982) Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * Update attention_backend.md: plural form (sgl-project#5489) * Add test for flash_attn_varlen_func kernel (sgl-project#5484) * Deprecate disable-mla (sgl-project#5481) * Deprecate enable-flashinfer-mla and enable-flashmla (sgl-project#5480) * Feat/support encoder model (like bert) (sgl-project#4887) * Enable local attention during decode (sgl-project#5479) * Refactor DeepSeek decoder layer branches (sgl-project#5205) * Fix a link in sgl-kernel/README.md (sgl-project#5493) * [Bug fix] use correct func path in deepseek (sgl-project#5496) Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> * Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (sgl-project#5503) * [Feat] Update sgl-kernel flashinfer to latest main version (sgl-project#5500) Co-authored-by: zhyncs <me@zhyncs.com> * Fix: Incorrect parameters passed to forward_batch_generation (sgl-project#5506) (sgl-project#5511) * Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (sgl-project#5426) Co-authored-by: ocss884 <ocss.lin@gmail.com> * [docs] Fix several consistency issues in sampling_params.md (sgl-project#5373) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Configuration qwen2_moe.py - qkv_bias now in transformers (sgl-project#5512) * Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (sgl-project#4836) * Sgl kernel fused_moe_gate support n_shared_experts (sgl-project#5440) * chore: bump sgl-kernel 0.0.9.post2 (sgl-project#5518) * use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (sgl-project#5473) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * fix kimi vl running bug after rebase main (sgl-project#5461) * fix bug of VLLM_AVAILABLE not defined (sgl-project#5497) * Avoid computing lse in Ragged Prefill when there's no prefix. (sgl-project#5476) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Model] Adding Qwen3 and Qwen3MoE (sgl-project#4693) * fix util import (sgl-project#5542) * Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (sgl-project#5544) * chore: upgrade sgl-kernel 0.0.9.post2 (sgl-project#5540) * Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (sgl-project#5340) * Make profiler output file names consistent (sgl-project#5548) * [PD] Tiny fix timeout error when generate (sgl-project#5545) * [PD] Fix no cache connect for recevier (sgl-project#5534) * feat: use flashinfer jit package (sgl-project#5547) * [PD] Remove the requirement of config file for mooncake backend (sgl-project#5460) * restruct compressed_tensors_w8a8_fp8 (sgl-project#5475) * simplify the control logic for using shared experts fusion (sgl-project#5504) * Remove one kernel in per_tensor_quant_mla_fp8 (sgl-project#5549) * Fix sampler nan check when calling top_k_top_p_sampling_from_probs (sgl-project#5546) * [PD] Support page size > 1 (sgl-project#5561) * fix hicache write back (sgl-project#5543) * Minor update for ROCm variable style (sgl-project#5562) * Fix bench_one_batch producing unnatural results for expert parallel (sgl-project#5149) * [perf] introduce deep gemm group_gemm_masked as bmm (sgl-project#5432) * [PD] Fix DeepSeek cannot be run on latest master (sgl-project#5568) * Fix BumpAllocator error when no input_ids (sgl-project#5564) * enable DeepSeek V3 shared_experts_fusion in sm90 (sgl-project#5571) * [Fix] fix outlines and xgrammar (sgl-project#4947) * [Doc]Add instruction for profiling with bench_one_batch (sgl-project#5581) * Release v0.4.5.post2 (sgl-project#5582) * Fix bench_serving fail when zero warmup requests (sgl-project#5574) * Fix DeepEP cannot run on latest master (sgl-project#5567) * Fix torch memory saver not enabled in DP scenario (sgl-project#5560) * Super tiny fix typo (sgl-project#5559) * Add document for LoRA serving (sgl-project#5521) * Tiny improve error message (sgl-project#5526) * [PD] Fix server crash when using batch requests (sgl-project#5531) * [Feat] upgrade pytorch2.6 (sgl-project#5417) * Fix enable chunked prefill for Llama4 (sgl-project#5575) * fix: use fa3 for gemma2 (sgl-project#5586) * Fix ChatCompletionMessageGenericParam to allow for None content (sgl-project#5452) * [PD] Fix large page size + chunk prefill (sgl-project#5588) * Add test config yamls for Deepseek v3 (sgl-project#5433) * [Feature] Prefill assistant response - add continue_final_message parameter (sgl-project#4226) Co-authored-by: Chayenne <zhaochen20@outlook.com> * add function call parser for DeepSeek V3 (sgl-project#5224) * smaller and non gated models for docs (sgl-project#5378) * Feat: Implement JSON Mode (response_format.type="json_object") (sgl-project#4733) Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net> * check marlin format before attempting conversion (sgl-project#4675) * compressed_tensors: port w8a16 fp8 from vllm (sgl-project#4852) * Fix one more issue reported by torchfix (sgl-project#4859) * Add sanity check for max_running_requests (sgl-project#5016) * Correct grafana heatmap. (sgl-project#5019) * Perform Batch Tokenization. (sgl-project#5141) * Speedup shared expert weight construction by avoid cloning (sgl-project#5188) * Tiny add Engine.flush_cache API (sgl-project#5241) * [misc] remove is_cuda_available (sgl-project#5319) * Fix flush cache (sgl-project#5590) * Add Speculative Decoding Eagle3 topk > 1 (sgl-project#5318) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> * upstream hicache fixes (sgl-project#5570) * Tiny add warning when cannot recognize bool env var (sgl-project#5348) * Modify metrics service endpoint (sgl-project#3443) * Update protocol.py to fix sgl-project#4589 (sgl-project#4590) * [Feat.] Enable grafana to show metrics (sgl-project#4718) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [Fix] Enhance DP Attention for IPv6 Compatibility (sgl-project#4937) * Support o1 model on Azure (sgl-project#4980) Co-authored-by: Shan Yu <shanyu1@g.ucla.edu> * Tiny remove duplicated code (sgl-project#5021) * Tiny update error hint (sgl-project#5037) * Support PD bootstrap fields on /v1/chat/completions endpoint (sgl-project#5488) * [PD] Fix generate endpoint of min_lb for PD (sgl-project#5598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Fix edge case and simplify large page size + chunked prefill (sgl-project#5589) * [PD] Add NIXL transfer backend (sgl-project#5477) * [PD] Support decode overlap schedule (sgl-project#5608) * [PD] Support prefill overlap + Ensure no race condition (sgl-project#5609) * Enhance GPU memory settings (sgl-project#5604) * [feature] enable pre compile jit deep_gemm (sgl-project#5580) * Clean up mem settings (sgl-project#5610) * Support aiter RMSNorm in AMD (sgl-project#5510) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> * chore: bump v0.4.5.post3 (sgl-project#5611) * Remove extra copy in deepseek forward absorb (sgl-project#5578) Co-authored-by: saienduri <saimanas.enduri@amd.com> * [Doc] Fix a 404 link to llama-405b (sgl-project#5615) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * [fix] force use deepgemm in compile_deep_gemm (sgl-project#5618) * [fix] fix compile_deep_gemm missing kv_b_proj (sgl-project#5620) * fix: gemma 3 not use softcap (sgl-project#5622) * Fix FA3 DeepSeek prefill performance regression (sgl-project#5624) Co-authored-by: ispobock <ispobaoke@gmail.com> * [NFC] Remove duplicate `compressed-tensors` (sgl-project#5640) * Fix shared experts fusion error without quantization (sgl-project#5632) * [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (sgl-project#5641) Co-authored-by: yuethe <yuethe@tencent.com> * fix flashmla bug (sgl-project#5272) * [fix] reduce dp capture bs (sgl-project#5634) Co-authored-by: alcanerian <alcanerian@gmail.com> * Remove q concat in FA3 backend for DeepSeek decode (sgl-project#5638) * Revert "Support aiter RMSNorm in AMD" (sgl-project#5646) * fix: update bench_speculative (sgl-project#5649) * Turn on DeepGemm By Default and Update Doc (sgl-project#5628) * Fuse q_a_proj and kv_a_proj (sgl-project#5619) * Remove unnecessary `torch.full` in DeepSeek (sgl-project#5601) * [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (sgl-project#5281) * fix sgl-kernel unit tests (sgl-project#5666) * fix awq_dequantize import (sgl-project#5669) * Integrating PD disaggregation with DP attention and DeepEP (sgl-project#5435) Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> * fix gemma3 unit test (sgl-project#5670) * fix torchvision::nms not exist (sgl-project#5671) * [PD] Add support for dp attention with mooncake (sgl-project#5530) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (sgl-project#5677) * [Doc] Fix two 404 links caused by sglang typo (sgl-project#5667) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: update truss bench_serving (sgl-project#5683) * fix: only compile ApplyTokenBitmaskInplace cu124+ (sgl-project#5686) * chore: bump sgl-kernel 0.1.0 (sgl-project#5688) * vlm: enable radix cache for qwen-vl models (sgl-project#5349) Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> * [BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (sgl-project#5707) * Fix weight loading bug for Deepseek v3+nextn (sgl-project#5684) * Add example to use sgl engine with fastapi (sgl-project#5648) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * [Doc] Fix a link to Weilin Zhao (sgl-project#5706) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * Add MMMU benchmark results (sgl-project#4491) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * [Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (sgl-project#5078) Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> * [PD] Better logs (sgl-project#5715) * [PD] Add kvargs table and thread pool for kvcache sender of mooncake (sgl-project#5738) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD]: Support Muti Prefill in one node (sgl-project#5704) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Fix: deepseek forward absorb (sgl-project#5723) Co-authored-by: ispobock <ispobaoke@163.com> * Pin torch audio to 2.6.0 (sgl-project#5750) * Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (sgl-project#5754) * Disable flaky eagle tests (sgl-project#5753) * update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (sgl-project#5740) * [Docs] Update runtime/engine/readme.md (sgl-project#5737) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * Reorder loop in shared expert weight loading (sgl-project#5719) * fix: fix one more bug from merging mm_inputs (sgl-project#5718) Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> * [Fix]: support deepseek-vl2-tiny model (sgl-project#5552) Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com> * Bugfix for minicpmo vision test (sgl-project#5760) * [Minor] fix documentations (sgl-project#5756) * Add an assertion to enhance the robustness of the operator (sgl-project#5736) * fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (sgl-project#5733) * Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <mail@kebe7jun.com> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <zhanweidu@163.com> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <sighingnow@gmail.com> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * fix for hpu backend in model runner and server args Signed-off-by: Mohit Sinha <msinha@habana.ai> * rebase formatting issue Signed-off-by: Mohit Sinha <msinha@habana.ai> * [SW-228218]: Fix device mismatch in frequency penalty. Ensure tensors in BatchedFrequencyPenalizer are on the same device by moving output_ids and frequency_penalties to the device of cumulated_frequency_penalties. This resolves a RuntimeError caused by tensors on cpu and hpu:0 during logits subtraction. --------- Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: congcongke <zhanweidu@163.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: BearBiscuit <55008898+BearBiscuit05@users.noreply.github.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Didier Durand <durand.didier@gmail.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: PGFLMG <1106310035@qq.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: ocss884 <ocss.lin@gmail.com> Co-authored-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: liwenju0 <like4hub@gmail.com> Co-authored-by: Wenxuan Tan <wtan45@wisc.edu> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: AmadeusW <41280211+Amadeus-Winarto@users.noreply.github.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Yi Zhou <zhouyi920521@gmail.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: kyle-pena-kuzco <kyle.pena@kuzco.xyz> Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net> Co-authored-by: Enrique Shockwave <33002121+qeternity@users.noreply.github.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: mac0ne <mac0ne@users.noreply.github.com> Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: moontidef <53668275+relic-yuexi@users.noreply.github.com> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: Lucius <souzou@foxmail.com> Co-authored-by: Chuyue Sun <33578456+ChuyueSun@users.noreply.github.com> Co-authored-by: Shan Yu <shanyu1@g.ucla.edu> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: michael-amd <Michael.Zhang@amd.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Connector Switch <c8ef@outlook.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: alcanerian <alcanerian@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: IAN <50618241+hcyz33@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: ZXN <44322223+bppps@users.noreply.github.com> Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com> Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com> Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: zhanweidu <zhanweidu@163.com> Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: JiLi <leege233@gmail.com> Co-authored-by: sighingnow <sighingnow@gmail.com> Co-authored-by: XTY <xutianyi1999@live.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>
1 parent d10c29d commit 8ef8859

File tree

293 files changed

+17016
-3935
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

293 files changed

+17016
-3935
lines changed

.github/workflows/nightly-test.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ jobs:
2525
- name: Install dependencies
2626
run: |
2727
bash scripts/ci_install_dependency.sh
28-
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
2928
3029
- name: Run test
3130
timeout-minutes: 120

.github/workflows/pr-test-amd.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,12 @@ jobs:
3838
else
3939
DEVICE_FLAG="--device /dev/dri"
4040
fi
41-
docker pull lmsysorg/sglang:v0.4.5-rocm630
41+
docker pull ghcr.io/saienduri/sglang-aiter-v0.1.1:428
4242
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
4343
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
4444
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
4545
-w /sglang-checkout --name ci_sglang \
46-
lmsysorg/sglang:v0.4.5-rocm630
46+
ghcr.io/saienduri/sglang-aiter-v0.1.1:428
4747
4848
- name: Install dependencies
4949
run: |
@@ -82,12 +82,12 @@ jobs:
8282
else
8383
DEVICE_FLAG="--device /dev/dri"
8484
fi
85-
docker pull lmsysorg/sglang:v0.4.5-rocm630
85+
docker pull ghcr.io/saienduri/sglang-aiter-v0.1.1:428
8686
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
8787
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
8888
--cap-add=SYS_PTRACE -e HF_TOKEN=${{ secrets.AMD_HF_TOKEN }} --security-opt seccomp=unconfined \
8989
-w /sglang-checkout --name ci_sglang \
90-
lmsysorg/sglang:v0.4.5-rocm630
90+
ghcr.io/saienduri/sglang-aiter-v0.1.1:428
9191
9292
- name: Install dependencies
9393
run: |
@@ -120,12 +120,12 @@ jobs:
120120
else
121121
DEVICE_FLAG="--device /dev/dri"
122122
fi
123-
docker pull lmsysorg/sglang:v0.4.5-rocm630
123+
docker pull ghcr.io/saienduri/sglang-aiter-v0.1.1:428
124124
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
125125
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
126126
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
127127
-w /sglang-checkout --name ci_sglang \
128-
lmsysorg/sglang:v0.4.5-rocm630
128+
ghcr.io/saienduri/sglang-aiter-v0.1.1:428
129129
130130
- name: Install dependencies
131131
run: |
@@ -149,7 +149,7 @@ jobs:
149149
finish:
150150
if: always()
151151
needs: [
152-
accuracy-test-1-gpu-amd, mla-test-1-gpu-amd
152+
accuracy-test-1-gpu-amd, mla-test-1-gpu-amd, bench-test-2-gpu-amd
153153
]
154154
runs-on: ubuntu-latest
155155
steps:

.github/workflows/pr-test-sgl-kernel.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ jobs:
8888
- name: Install
8989
run: |
9090
bash scripts/ci_install_dependency.sh
91-
pip3 install torch==2.5.1 && pip3 install pytest
91+
pip3 install torch==2.6.0 torchvision && pip3 install pytest
9292
pip3 uninstall sgl-kernel -y || true
9393
pip3 install sgl-kernel/dist/*whl --force-reinstall --no-deps
9494
pip3 list | grep sgl-kernel

.github/workflows/pr-test.yml

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,6 @@ jobs:
3838
uses: actions/checkout@v4
3939

4040
- name: Install dependencies
41-
env:
42-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
4341
run: |
4442
bash scripts/ci_install_dependency.sh
4543
@@ -56,22 +54,20 @@ jobs:
5654
strategy:
5755
fail-fast: false
5856
matrix:
59-
part: [0, 1, 2, 3, 4, 5, 6]
57+
part: [0, 1, 2, 3, 4, 5, 6, 7]
6058
steps:
6159
- name: Checkout code
6260
uses: actions/checkout@v4
6361

6462
- name: Install dependencies
65-
env:
66-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
6763
run: |
6864
bash scripts/ci_install_dependency.sh
6965
7066
- name: Run test
71-
timeout-minutes: 40
67+
timeout-minutes: 30
7268
run: |
7369
cd test/srt
74-
python3 run_suite.py --suite per-commit --auto-partition-id ${{ matrix.part }} --auto-partition-size 7
70+
python3 run_suite.py --suite per-commit --auto-partition-id ${{ matrix.part }} --auto-partition-size 8
7571
7672
unit-test-backend-2-gpu:
7773
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
@@ -82,8 +78,6 @@ jobs:
8278
uses: actions/checkout@v4
8379

8480
- name: Install dependencies
85-
env:
86-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
8781
run: |
8882
bash scripts/ci_install_dependency.sh
8983
@@ -93,10 +87,10 @@ jobs:
9387
cd test/srt
9488
python3 run_suite.py --suite per-commit-2-gpu
9589
96-
performance-test-1-gpu-part-1:
90+
unit-test-backend-8-gpu:
9791
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
9892
github.event.pull_request.draft == false
99-
runs-on: 1-gpu-runner
93+
runs-on: 8-gpu-runner
10094
steps:
10195
- name: Checkout code
10296
uses: actions/checkout@v4
@@ -107,11 +101,30 @@ jobs:
107101
run: |
108102
bash scripts/ci_install_dependency.sh
109103
104+
- name: Run test
105+
timeout-minutes: 40
106+
run: |
107+
cd test/srt
108+
python3 run_suite.py --suite per-commit-8-gpu
109+
110+
performance-test-1-gpu-part-1:
111+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
112+
github.event.pull_request.draft == false
113+
runs-on: 1-gpu-runner
114+
steps:
115+
- name: Checkout code
116+
uses: actions/checkout@v4
117+
118+
- name: Install dependencies
119+
run: |
120+
bash scripts/ci_install_dependency.sh
121+
110122
- name: Benchmark single latency
111123
timeout-minutes: 10
112124
run: |
113125
cd test/srt
114-
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1
126+
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1_small
127+
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1_default
115128
116129
- name: Benchmark online latency
117130
timeout-minutes: 10
@@ -146,8 +159,6 @@ jobs:
146159
uses: actions/checkout@v4
147160

148161
- name: Install dependencies
149-
env:
150-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
151162
run: |
152163
bash scripts/ci_install_dependency.sh
153164
@@ -178,8 +189,6 @@ jobs:
178189
uses: actions/checkout@v4
179190

180191
- name: Install dependencies
181-
env:
182-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
183192
run: |
184193
bash scripts/ci_install_dependency.sh
185194
@@ -216,8 +225,6 @@ jobs:
216225
uses: actions/checkout@v4
217226

218227
- name: Install dependencies
219-
env:
220-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
221228
run: |
222229
bash scripts/ci_install_dependency.sh
223230
git clone https://github.com/merrymercy/human-eval.git
@@ -239,8 +246,6 @@ jobs:
239246
uses: actions/checkout@v4
240247

241248
- name: Install dependencies
242-
env:
243-
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
244249
run: |
245250
bash scripts/ci_install_dependency.sh
246251
git clone https://github.com/merrymercy/human-eval.git

.github/workflows/vllm-dependency-test.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,6 @@ jobs:
2828
uses: actions/checkout@v4
2929

3030
- name: Install dependencies
31-
env:
32-
FLASHINFER_REPO: 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python'
3331
run: |
3432
bash scripts/ci_install_dependency.sh
3533
pip install "vllm>=0.6.4.post1,<=0.7.2"

3rdparty/amd/tuning/benchmark_moe_rocm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
get_config_file_name,
1616
)
1717

18-
padding_size = 128 if bool(int(os.getenv("MOE_PADDING", "0"))) else 0
18+
padding_size = 128 if bool(int(os.getenv("SGLANG_MOE_PADDING", "0"))) else 0
1919

2020

2121
def main(model, tp_size, dtype: str, batches):

Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ FILES_TO_UPDATE = docker/Dockerfile.rocm \
2020
python/pyproject.toml \
2121
python/sglang/version.py \
2222
docs/developer/setup_github_runner.md \
23-
docs/start/install.md
23+
docs/start/install.md \
24+
benchmark/deepseek_v3/README.md
2425

2526
update: ## Update version numbers across project files. Usage: make update <new_version>
2627
@if [ -z "$(filter-out $@,$(MAKECMDGOALS))" ]; then \

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ SGLang is a fast serving framework for large language models and vision language
4343
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
4444
The core features include:
4545

46-
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
46+
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
4747
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
4848
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, QWen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
4949
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
@@ -71,5 +71,5 @@ It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor
7171

7272
For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at contact@sglang.ai.
7373

74-
## Acknowledgment and Citation
75-
We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql). Please cite the paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.
74+
## Acknowledgment
75+
We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).

0 commit comments

Comments
 (0)