Skip to content

Register Fp4 allgather with NCCL symmetric memory #9358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nvcastet
Copy link
Collaborator

Motivation

5% e2e perf gain

Depends on #8934

After this PR:

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention --chunked-prefill-size 16384 --mem-fraction-static 0.85 --max-running-requests 4096 --stream-interval 5 --enable-dp-lm-head --attention-backend trtllm_mla --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 --disable-radix-cache  --enable-symm-mem

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1024 --random-input 1024 --random-output 2048 --random-range-ratio 1 --warmup-request 1024 --max-concurrency 1024
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  99.23
Total input tokens:                      1048576
Total generated tokens:                  2097152
Total generated tokens (retokenized):    2088173
Request throughput (req/s):              10.32
Input token throughput (tok/s):          10566.84
Output token throughput (tok/s):         21133.69
Total token throughput (tok/s):          31700.53
Concurrency:                             1022.32
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   99069.51
Median E2E Latency (ms):                 99077.61
---------------Time to First Token----------------
Mean TTFT (ms):                          8812.57
Median TTFT (ms):                        8655.88
P99 TTFT (ms):                           15737.66
---------------Inter-Token Latency----------------
Mean ITL (ms):                           44.09
Median ITL (ms):                         40.47
P95 ITL (ms):                            44.81
P99 ITL (ms):                            46.93
Max ITL (ms):                            2788.35
==================================================

Before:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  104.55
Total input tokens:                      1048576
Total generated tokens:                  2097152
Total generated tokens (retokenized):    2090674
Request throughput (req/s):              9.79
Input token throughput (tok/s):          10029.54
Output token throughput (tok/s):         20059.08
Total token throughput (tok/s):          30088.62
Concurrency:                             1022.47
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104392.08
Median E2E Latency (ms):                 104404.71
---------------Time to First Token----------------
Mean TTFT (ms):                          8784.95
Median TTFT (ms):                        8615.13
P99 TTFT (ms):                           15843.36
---------------Inter-Token Latency----------------
Mean ITL (ms):                           46.71
Median ITL (ms):                         42.99
P95 ITL (ms):                            48.36
P99 ITL (ms):                            51.30
Max ITL (ms):                            2754.55
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@@ -1306,7 +1320,15 @@ def apply(
tune_max_num_tokens=next_power_of_2(x.shape[0]),
)[0]
if moe_runner_config.routed_scaling_factor is not None:
output *= moe_runner_config.routed_scaling_factor
with use_symmetric_memory(
Copy link
Collaborator

@trevor-m trevor-m Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we allocate a symm buffer and pass it as the output argument in flashinfer_cutlass_fused_moe() instead?
This multiply will be removed in #8690
Also *= is done in-place so the symmetric buffer will still be used until then

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks for catching that, let me do that.

with use_symmetric_memory(
get_tp_group(), disabled=not is_max_padding()
) as sm:
symm_output = torch.empty_like(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When x is quantized this will have the wrong dtype and shape (since hidden size will be half). We will want to use output_dtype and maybe store original x_col

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I fixed that in #8934 . Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants