Skip to content

Conversation

vhain
Copy link
Contributor

@vhain vhain commented Mar 27, 2025

Motivation

Current main branch has broken dependency. Following packages are not defined as dependency (in pyproject.toml) but is required to run srt (via sglang.launch_server):

Although this PR does not tackle all the broken dependencies, it fixes two easy ones.

Modifications

This PR updates codes that use torchvision, and gguf to lazy load them.

Next Steps

Something need to be done with remainders:

  • compressed_tensors - I saw on-going discussion in [Fix] Add compressed_tensors as deps #4819 (comment)
  • openai - I think we could define our own BadRequestError class or equivalent and refrain from using this type since we have only single usage of openai from srt:
  • partial_json_parser - I think we could include this as srt's dependency as it's being used in the base detector class for tool (function) calling.
  • einops - I think this could be included as srt's dependency as it's being widely used. or we could make it optional (lazy load) since it's seems to be only required for vision models?

Checklist

@vhain vhain changed the title deps: lazy load optional dependencies gguf and torchvision deps: lazy import optional dependencies gguf and torchvision Mar 27, 2025
@zhyncs
Copy link
Member

zhyncs commented Mar 27, 2025

partial_json_parser - I think we could include this as srt's dependency as it's being used in the base detector class for tool (function) calling.

einops - I think this could be included as srt's dependency as it's being widely used. or we could make it optional (lazy load) since it's seems to be only required for vision models?

agree @vhain

@zhyncs zhyncs mentioned this pull request Mar 27, 2025
6 tasks
@zhyncs
Copy link
Member

zhyncs commented Mar 27, 2025

@vhain please merge the latest main

@vhain vhain force-pushed the ryan/optional-deps/gguf_torchvision branch from acd58ba to 8c4bee4 Compare March 27, 2025 20:31
@zhyncs
Copy link
Member

zhyncs commented Mar 27, 2025

srt git:(ryan/optional-deps/gguf_torchvision) python3 -m unittest test_vision_openai_server.TestGemma3itServer
command=python3 -m sglang.launch_server --model-path google/gemma-3-4b-it --trust-remote-code --chat-template gemma-it --host 127.0.0.1 --port 2157
[2025-03-27 21:33:10] server_args=ServerArgs(model_path='google/gemma-3-4b-it', tokenizer_path='google/gemma-3-4b-it', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='google/gemma-3-4b-it', chat_template='gemma-it', completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=2157, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=781419702, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
[2025-03-27 21:33:10] For Gemma 3, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
[2025-03-27 21:33:10] Downcasting torch.float32 to torch.bfloat16.
[2025-03-27 21:33:10] The following error message 'operation scheduled before its operands' can be ignored.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[2025-03-27 21:33:18] Use chat template for the OpenAI-compatible API server: gemma-it
[2025-03-27 21:33:20 TP0] For Gemma 3, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
[2025-03-27 21:33:20 TP0] Downcasting torch.float32 to torch.bfloat16.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[2025-03-27 21:33:23 TP0] Overlap scheduler is disabled for multimodal models.
[2025-03-27 21:33:23 TP0] For Gemma 3, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
[2025-03-27 21:33:23 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-03-27 21:33:23 TP0] Automatically reduce --mem-fraction-static to 0.836 because this is a multimodal model.
[2025-03-27 21:33:23 TP0] Init torch distributed begin.
[2025-03-27 21:33:26 TP0] Failed to import pynvml with ModuleNotFoundError("No module named 'pynvml'")
[2025-03-27 21:33:26 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-03-27 21:33:26 TP0] Load weight begin. avail mem=139.21 GB
[2025-03-27 21:33:26 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-27 21:33:28 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.21s/it]

[2025-03-27 21:33:30 TP0] Load weight end. type=Gemma3ForConditionalGeneration, dtype=torch.bfloat16, avail mem=131.08 GB, mem usage=8.13 GB.
[2025-03-27 21:33:31 TP0] KV Cache is allocated. #tokens: 834615, K size: 54.12 GB, V size: 54.12 GB
[2025-03-27 21:33:31 TP0] Memory pool end. avail mem=14.61 GB
2025-03-27 21:33:31,320 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-03-27 21:33:31 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=13.95 GB
Capturing batches (avail_mem=12.55 GB):   0%|                                                                                                                       | 0/23 [00:00<?, ?it/s]2025-03-27 21:33:31,890 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False
2025-03-27 21:33:31,913 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False
Capturing batches (avail_mem=9.72 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:08<00:00,  2.59it/s]
[2025-03-27 21:33:40 TP0] Capture cuda graph end. Time elapsed: 8.91 s. avail mem=9.66 GB. mem usage=4.29 GB.
/sgl-workspace/sglang/python/sglang/srt/utils.py:823: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
  tensor_data = torch.ByteTensor(
[2025-03-27 21:33:44 TP0] max_total_num_tokens=834615, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=1048576
[2025-03-27 21:33:45] INFO:     Started server process [43750]
[2025-03-27 21:33:45] INFO:     Waiting for application startup.
[2025-03-27 21:33:45] INFO:     Application startup complete.
[2025-03-27 21:33:45] INFO:     Uvicorn running on http://127.0.0.1:2157 (Press CTRL+C to quit)
[2025-03-27 21:33:46] INFO:     127.0.0.1:46746 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-03-27 21:33:46 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
2025-03-27 21:33:46,416 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-03-27 21:33:46,442 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-03-27 21:33:46] INFO:     127.0.0.1:46748 - "POST /generate HTTP/1.1" 200 OK
[2025-03-27 21:33:46] The server is fired up and ready to roll!
[2025-03-27 21:33:53 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-27 21:33:54] INFO:     127.0.0.1:47404 - "GET /health_generate HTTP/1.1" 200 OK
.[2025-03-27 21:33:54 TP0] Prefill batch. #new-seq: 1, #new-token: 23, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-27 21:33:54 TP0] Prefill batch. #new-seq: 1, #new-token: 283, #cached-token: 1, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-03-27 21:33:55 TP0] Prefill batch. #new-seq: 2, #new-token: 273, #cached-token: 295, token usage: 0.00, #running-req: 2, #queue-req: 0,
[2025-03-27 21:33:55 TP0] Number of images does not match number of special image tokens in the input text. Got 256 image tokens in the text but 512 tokens from image embeddings.
[2025-03-27 21:33:55] INFO:     127.0.0.1:47450 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:55] INFO:     127.0.0.1:47416 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:55 TP0] Decode batch. #running-req: 2, #token: 345, token usage: 0.00, gen throughput (token/s): 9.21, #queue-req: 0,
[2025-03-27 21:33:55 TP0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 306, token usage: 0.00, #running-req: 2, #queue-req: 0,
[2025-03-27 21:33:55] INFO:     127.0.0.1:47420 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:55] INFO:     127.0.0.1:47434 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:55 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 283, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-03-27 21:33:55] INFO:     127.0.0.1:47452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:55] INFO:     127.0.0.1:47458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:55 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 283, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-03-27 21:33:55 TP0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 306, token usage: 0.00, #running-req: 2, #queue-req: 0,
[2025-03-27 21:33:56] INFO:     127.0.0.1:47480 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:56 TP0] Decode batch. #running-req: 3, #token: 346, token usage: 0.00, gen throughput (token/s): 179.04, #queue-req: 0,
[2025-03-27 21:33:56] INFO:     127.0.0.1:47482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:56] INFO:     127.0.0.1:47472 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:56] INFO:     127.0.0.1:47496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:56 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 283, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-27 21:33:56 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 23, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-03-27 21:33:56 TP0] Decode batch. #running-req: 2, #token: 328, token usage: 0.00, gen throughput (token/s): 129.39, #queue-req: 0,
[2025-03-27 21:33:56] INFO:     127.0.0.1:47508 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-27 21:33:56] INFO:     127.0.0.1:47512 - "POST /v1/chat/completions HTTP/1.1" 200 OK
.[2025-03-27 21:33:57 TP0] Prefill batch. #new-seq: 1, #new-token: 555, #cached-token: 12, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-27 21:33:57 TP0] Decode batch. #running-req: 1, #token: 596, token usage: 0.00, gen throughput (token/s): 53.99, #queue-req: 0,
[2025-03-27 21:33:57 TP0] Decode batch. #running-req: 1, #token: 636, token usage: 0.00, gen throughput (token/s): 100.52, #queue-req: 0,
[2025-03-27 21:33:57] INFO:     127.0.0.1:47526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
------------------------------
Multi images response:
Here's a description of each image:

**Image 1:** A man is ironing clothes on a portable ironing board while standing on the roof of a yellow taxi in a busy New York City street.

**Image 2:** The logo for "SGL" features a stylized branching tree icon connected to a small square containing code symbols.
------------------------------
.[2025-03-27 21:33:58 TP0] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 284, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-27 21:33:58] INFO:     127.0.0.1:47532 - "POST /v1/chat/completions HTTP/1.1" 200 OK
..[2025-03-27 21:33:58 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 283, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-27 21:33:58 TP0] Decode batch. #running-req: 1, #token: 308, token usage: 0.00, gen throughput (token/s): 58.87, #queue-req: 0,
[2025-03-27 21:33:58] INFO:     127.0.0.1:47540 - "POST /v1/chat/completions HTTP/1.1" 200 OK
..
----------------------------------------------------------------------
Ran 7 tests in 55.491s

OK

It works well for me locally.

@zhyncs zhyncs merged commit 188105a into sgl-project:main Mar 27, 2025
25 of 36 checks passed
jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
* Fix ut mla-test-1-gpu-amd (sgl-project#4813)

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

* Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638)

* [k8s] Clarified the usage of shared memory. (sgl-project#4341)

* gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823)

* add partial_json_parser and einops (sgl-project#4827)

* fix the release doc dependency issue (sgl-project#4828)

* Update doc for DeepSeek-V3-0324 (sgl-project#4825)

* deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826)

* Update MMMU Benchmark instructions (sgl-project#4694)

* Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830)

* Basic Cleanup (sgl-project#4833)

* Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770)

Co-authored-by: Cheng Wan <cwan39@gatech.edu>

* [Fix] Add compressed_tensors as deps (sgl-project#4819)

* Fix error due to CustomAllreduce setup failure (sgl-project#4815)

Signed-off-by: Kebe <mail@kebe7jun.com>

* use default for torch.ops (sgl-project#4835)

* [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969)

* [Misc] Fix issues reported by torchfix (sgl-project#4837)

* Include context length in /v1/models response. (sgl-project#4809)

* [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* Fix the lora adapter when lora path is none (sgl-project#4799)

Co-authored-by: Beichen Ma <mabeichen12@gmail.com>

* fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843)

* Remove retry in nightly tests (sgl-project#4846)

* Fix CI of test_patch_torch (sgl-project#4844)

* IPv6 support (sgl-project#3949)

Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* ci: add condition for daily docker build (sgl-project#4487)

* [Fix] fix output_top_logprobs is not exist (sgl-project#4597)

* fix: when use SGLANG_PORT this env,port is str (sgl-project#4528)

Signed-off-by: rongfu.leng <lenronfu@gmail.com>

* Support Page Size > 1 for FA3 (sgl-project#4832)

Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* Fix Engine error when enabling DP attention (sgl-project#4648)

* fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681)

* Support controlling nsys start and end range programmatically (sgl-project#4688)

* Remove empty tool function name (sgl-project#4704)

Signed-off-by: Kebe <mail@kebe7jun.com>

* Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712)

* get the python version from env (sgl-project#4729)

* Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* Super tiny remove unused code (sgl-project#4750)

* Support with_stack and record_shapes in profiler (sgl-project#4740)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840)

* Fix CI tests (sgl-project#4853)

* Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855)

* Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863)

* [Feature] add multi-rank support for Lora (sgl-project#4492)

Co-authored-by: rudy152 <czh1137892874@gmail.com>

* Clean up `import vllm` in quantization/__init__.py (sgl-project#4834)

* Fix wrong variable name when stopping memory profile (sgl-project#4772)

* [Feat] support deepgemm for cmake (sgl-project#4864)

* Make torch compile configurable for biased_grouped_topk (sgl-project#4749)

* update sgl-kernel test ci (sgl-project#4866)

* fix sampling issue (sgl-project#4871)

* bump sgl-kernel 0.0.5.post4 (sgl-project#4768)

* fix sgl-kernel cu118 build (sgl-project#4872)

* [Feature] Support FA3 backend for MLA (sgl-project#4831)

* upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873)

* update torch compile doc (sgl-project#4874)

* bump v0.4.4.post3 (sgl-project#4878)

* Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882)

* Improve stack trace of retry errors (sgl-project#4845)

* Tiny fix doc error (sgl-project#4795)

* [Docs] Update DeepGEMM at README.md (sgl-project#4886)

* Update CODEOWNERS (sgl-project#4889)

* Delete test_deep_gemm.py (sgl-project#4891)

* Add deepseek style fused moe group gate selection kernel (sgl-project#4530)

* quick fix: add default for new kernel (sgl-project#4898)

* remove setup for sgl-kernel (sgl-project#4899)

* [Misc] Clean m.def and add Development Tips (sgl-project#4890)

* fix allreduce test (sgl-project#4909)

* Support page size > 1 + eagle (sgl-project#4908)

* Fix retract for page size > 1 (sgl-project#4914)

* [Feature] use pytest for sgl-kernel (sgl-project#4896)

* fix bmm fp8 (sgl-project#4926)

* Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927)

* Fix 2-gpu CI test and suppress some warnings (sgl-project#4930)

* [feat] add fa3 in sgl-kernel (sgl-project#4902)

Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>

* Fix sglang frontend's incorrect dependency on torch (sgl-project#4931)

* [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932)

* cleanup sgl-kernel (sgl-project#4933)

* [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925)

* Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883)

Co-authored-by: ch-wan <cwan39@gatech.edu>

* [Fix] Add torch compile for torch.clamp back (sgl-project#4936)

* Fix oom error for large page size (sgl-project#4913)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* [feat] interface for platforms abstraction (sgl-project#4928)

* [Fix] revert clean m.def for cudagraph (sgl-project#4944)

* refactor: multimodal data (sgl-project#4754)

* bump sgl-kernel v0.0.6 (sgl-project#4950)

* [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953)

* use fa3 in sgl-kernel (sgl-project#4954)

* Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959)

* [Feature] Support DeepEP Low Latency (sgl-project#4767)

Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>

* update bench_serving (sgl-project#4958)

* Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977)

* [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: GeLee <leege233@gmail.com>

* Large page size aligned hierarchical caching (sgl-project#4581)

* bug fix for hicache host eviction (sgl-project#4989)

* sgl scaled_fp8_quant support output padding (sgl-project#4861)

* Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951)

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>

* Update tokenizer_manager.py (sgl-project#5008)

* [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817)

* update cutlass tag (sgl-project#5011)

* Feature/revise docs ci (sgl-project#5009)

* fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727)

Co-authored-by: yuethe <yuethe@tencent.com>

* [Build] Support build sgl-kernel with ccache (sgl-project#5020)

* fix deepgemm as well (sgl-project#5030)

* try to fix ci oserror (sgl-project#5024)

* Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005)

* Small refactor DeepEPMode to clean up code a bit (sgl-project#4992)

* [Fix] fix fa3 build at cu118 (sgl-project#5036)

* Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048)

* bump sgl-kernel v0.0.7 (sgl-project#5046)

* update eagle-3 docs (sgl-project#4796)

Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn>

* Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* Update the retry count (sgl-project#5051)

* upgrade sgl-kernel v0.0.7 (sgl-project#5049)

* [2/3] fix dsv3 awq issue  (sgl-project#4625)

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>

* Feature/revise docs ci (sgl-project#5056)

* Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057)

* [fix] remove `cuda_device_count_stateless` (sgl-project#5060)

* Small refactor DeepEPDispatcher into subclasses (sgl-project#4994)

* Support async DeepEP by splitting into two stages (sgl-project#4995)

* Cleanup unused resources after DeepEP operation (sgl-project#4996)

* Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

* [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072)

* fix dummy-load deepseekv2 (sgl-project#4535)

* support sgl-kernel on blackwell (sgl-project#5074)

* FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050)

Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>

* [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052)

* upgrade transformers 4.51.0 (sgl-project#5088)

* sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079)

* bump sgl-kernel 0.0.8 (sgl-project#5089)

* python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080)

* bump v0.4.4.post4 (sgl-project#5091)

* Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>

* Add Llama4 support (sgl-project#5092)

Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>

* Fix refactor error - fp8.py (sgl-project#5106)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* bump v0.4.5 (sgl-project#5117)

* Workaround for async copy issue in HPU eager mode (sgl-project#1)

Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>

* [SW-223847]: Fix sgl_kernel module not available (sgl-project#2)

Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>

* [Base] Enable torch compile (sgl-project#4)

* [SW-226331] disable dynamic shape in torch compile mode

Signed-off-by: Mohit Sinha <msinha@habana.ai>

---------

Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com>
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Co-authored-by: AinL <gmlwns5176@gmail.com>
Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com>
Co-authored-by: Juwan Yoo <ryan@tmfi.us>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Daniel Holanda <holand.daniel@gmail.com>
Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com>
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Jon Durbin <jon@jondurbin.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <qy254@cornell.edu>
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com>
Co-authored-by: warjiang <1096409085@qq.com>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: rongfu.leng <lenronfu@gmail.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com>
Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>
Co-authored-by: DavidChan <chengwei0519@163.com>
Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com>
Co-authored-by: rudy152 <czh1137892874@gmail.com>
Co-authored-by: Fr4nk1in <sh.fu@outlook.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>
Co-authored-by: SEPLOS <seplos@aliyun.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: GeLee <leege233@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
Co-authored-by: Kaiyu Yang <yangky@umich.edu>
Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: yuethe <yuethe@tencent.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: Tommy Yang <tommyyang0524@gmail.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com>
Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
Co-authored-by: Jay Thakur <jthakur@habana.ai>
Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants