[metrics] Add in queue metrics #4444

hebiao064 · 2025-03-15T01:11:49Z

Motivation

Same as #4412, the last PR was closed because of some github issues

When serving LLMs at scale, understanding where time is spent during request processing is crucial for optimization. The current metrics don't provide enough granularity to identify specific bottlenecks in the request lifecycle.

Note about performance concern

We only emit those metrics when --enable-metrics are specified.

Future Work

If this approach is well-received, I plan to implement additional latency breakdowns for:

prefix_cache_lookup time

Metrics Result

# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:avg_request_queue_latency The average request queue latency.
# TYPE sglang:avg_request_queue_latency gauge
sglang:avg_request_queue_latency{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 2.9325485229492188e-05
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.6962933540344238
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.3",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.7",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="0.9",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="4.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="6.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="8.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="40.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="60.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="80.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="120.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="160.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.6962814331054688
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="150.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="250.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="300.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="350.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="500.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1000.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE sglang:time_per_output_token_seconds histogram
sglang:time_per_output_token_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.034814071655273435
sglang:time_per_output_token_seconds_bucket{le="0.002",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.06",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.07",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.08",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.09",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.6",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.8",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 91.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 20.0
# HELP sglang:cached_tokens_total Number of cached prompt tokens.
# TYPE sglang:cached_tokens_total counter
sglang:cached_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0

Benchmark Result

python -m sglang.bench_one_batch --model-path /path/to/Llama-3.2-3B-Instruct --batch 1 --input-len 32768 --output-len 1 --quantization w8a8_fp8

# Before
Benchmark ...
Prefill. latency: 0.28452 s, throughput: 115169.06 token/s
Total. latency:  0.285 s, throughput: 115172.57 token/s

# After
Benchmark ...
Prefill. latency: 0.28434 s, throughput: 115242.84 token/s
Total. latency:  0.284 s, throughput: 115246.35 token/s

As expected, the instrumentation adds negligible overhead (within normal benchmark fluctuation). This confirms that the metrics collection doesn't impact performance while providing valuable insights.

Modifications

This PR takes a minimalist approach by focusing only on queue latency as a first step. We're setting queue_time_start when requests enter the queue and queue_time_end when they're selected for processing, then calculating the average latency across all requests in a batch.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

hebiao064 · 2025-03-16T04:57:29Z

@merrymercy @zhyncs friendly ping on review

hebiao064 · 2025-03-17T20:29:29Z

@zhyncs @merrymercy friendly ping, or pls let me know if this PR is not good, thanks!

This PR passed all test in one trial, so it should be good.

hebiao064 · 2025-03-30T06:56:45Z

discard it since no review

xiezhq-hermann · 2025-03-30T20:06:36Z

@hebiao064 really sorry for the long delay! I think this is a great addition. Would you mind writting a test for the new metrics and provide some brief logs having this enabled?

hebiao064 · 2025-03-31T00:58:49Z

@xiezhq-hermann thanks, I've added the logs in the PR Description and I don't see any test about metrics...

python/sglang/srt/managers/scheduler.py

xiezhq-hermann · 2025-04-04T00:36:37Z

Hi @hebiao064 would you mind resolve my comments on the codes and we can then get this merged soon?

python/sglang/srt/managers/scheduler.py

python/sglang/srt/metrics/collector.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/managers/schedule_policy.py

…glang into add_in_queue_metrics

* main: (29 commits) reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) Fix DeepSeek error when using DeepEP mode (sgl-project#5190) [metrics] Add in queue metrics (sgl-project#4444) fix: log warning when disable cuda graph (sgl-project#5209) Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) update grok test (sgl-project#5171) model: support mllama4 (sgl-project#5144) [ci] fix ci test fused_moe op (sgl-project#5102) Support Llama4 fp8 inference (sgl-project#5194) Optimize topk operation in llama4 (sgl-project#5128) Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) [DeepEP] fix: import buffer error (sgl-project#5179) fix: use DeepEPDispatcher on CUDA (sgl-project#5180) feat: add DeepGEMM build warning (sgl-project#5176) docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) ... # Conflicts: # python/sglang/srt/disaggregation/mini_lb.py # python/sglang/srt/managers/scheduler.py

* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <ustcsqq@gmail.com> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <cwan39@gatech.edu> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <streetyao@live.com> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <wunhuang@amd.com> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: shangmingc <csmthu@gmail.com> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <me@zhyncs.com> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <cwan39@gatech.edu> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <ispobaoke@163.com> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <me@zhyncs.com> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yun Dai <yundai424@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: grimoire <streetyao@live.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Teng Ma <805522925@qq.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Yusong Gao <yusong.gao@icloud.com> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: tianlian yi <91449279+yitianlian@users.noreply.github.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: yulei <yuulei12@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Yangcheng Li <bluebluelitchi@hotmail.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com>

hebiao064 added 3 commits March 14, 2025 18:21

Add In Queue Latency

df9a0c8

Add In queue Latency

cf53ad8

Fix Comment

c982f0d

hebiao064 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners March 15, 2025 01:11

hebiao064 added 3 commits March 14, 2025 18:13

Merge branch 'main' into add_in_queue_metrics

213493f

Merge branch 'main' into add_in_queue_metrics

fe2907a

Merge branch 'main' into add_in_queue_metrics

fd8f58a

Merge branch 'main' into add_in_queue_metrics

a9f8a6e

hebiao064 requested a review from xiezhq-hermann as a code owner March 17, 2025 06:01

Merge branch 'main' into add_in_queue_metrics

7ed72dd

hebiao064 added 3 commits March 17, 2025 22:51

Merge branch 'main' into add_in_queue_metrics

3e99215

Merge branch 'main' into add_in_queue_metrics

1070d4d

Merge branch 'main' into add_in_queue_metrics

3745b14

hebiao064 closed this Mar 30, 2025

xiezhq-hermann reopened this Mar 30, 2025

Merge branch 'main' into add_in_queue_metrics

3126e28

Merge branch 'main' into add_in_queue_metrics

5e50267

hebiao064 commented Mar 31, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Show resolved Hide resolved

xiezhq-hermann mentioned this pull request Mar 31, 2025

[metrics] Add req metrics #4946

Closed

6 tasks

Merge branch 'main' into add_in_queue_metrics

c3c2d1e

Merge branch 'main' into add_in_queue_metrics

c729e6c

xiezhq-hermann reviewed Apr 6, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Show resolved Hide resolved

python/sglang/srt/metrics/collector.py Outdated Show resolved Hide resolved

hebiao064 added 4 commits April 5, 2025 19:18

Merge branch 'main' into add_in_queue_metrics

cfd7abb

Merge branch 'main' into add_in_queue_metrics

ff573b2

address comment

86fbd5a

address comment

c0f8bc5

xiezhq-hermann self-requested a review April 9, 2025 04:57

xiezhq-hermann reviewed Apr 9, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

hebiao064 and others added 2 commits April 9, 2025 05:14

fix

c8d6cc7

Merge branch 'main' into add_in_queue_metrics

31960af

xiezhq-hermann approved these changes Apr 9, 2025

View reviewed changes

xiezhq-hermann reviewed Apr 9, 2025

View reviewed changes

python/sglang/srt/managers/schedule_policy.py Outdated Show resolved Hide resolved

hebiao064 added 5 commits April 9, 2025 22:52

fix

5a92a3d

Merge branch 'add_in_queue_metrics' of https://github.com/hebiao064/s…

50b0ac4

…glang into add_in_queue_metrics

Merge branch 'main' into add_in_queue_metrics

4f0e21a

fix

3a17e9e

Merge branch 'add_in_queue_metrics' of https://github.com/hebiao064/s…

2de6f05

…glang into add_in_queue_metrics

zhyncs merged commit 5db37c8 into sgl-project:main Apr 10, 2025
18 of 23 checks passed

finger92 pushed a commit to protagolabs/sglang that referenced this pull request Apr 10, 2025

[metrics] Add in queue metrics (sgl-project#4444)

db9d215

zhyncs mentioned this pull request Apr 10, 2025

Update deps for mllama4 #5215

Merged

thyecust pushed a commit to thyecust/sglang that referenced this pull request Apr 11, 2025

[metrics] Add in queue metrics (sgl-project#4444)

414a840

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

[metrics] Add in queue metrics (sgl-project#4444)

864e984

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[metrics] Add in queue metrics #4444

[metrics] Add in queue metrics #4444

Uh oh!

hebiao064 commented Mar 15, 2025

Uh oh!

hebiao064 commented Mar 16, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

hebiao064 commented Mar 30, 2025

Uh oh!

xiezhq-hermann commented Mar 30, 2025

Uh oh!

hebiao064 commented Mar 31, 2025

Uh oh!

Uh oh!

xiezhq-hermann commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[metrics] Add in queue metrics #4444

[metrics] Add in queue metrics #4444

Uh oh!

Conversation

hebiao064 commented Mar 15, 2025

Motivation

Note about performance concern

Future Work

Metrics Result

Benchmark Result

Modifications

Checklist

Uh oh!

hebiao064 commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

hebiao064 commented Mar 30, 2025

Uh oh!

xiezhq-hermann commented Mar 30, 2025

Uh oh!

hebiao064 commented Mar 31, 2025

Uh oh!

Uh oh!

xiezhq-hermann commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 16, 2025 •

edited

Loading