Register Fp4 allgather with NCCL symmetric memory #9358

nvcastet · 2025-08-19T18:15:03Z

Motivation

5% e2e perf gain

Depends on #8934

After this PR:

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention --chunked-prefill-size 16384 --mem-fraction-static 0.85 --max-running-requests 4096 --stream-interval 5 --enable-dp-lm-head --attention-backend trtllm_mla --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 --disable-radix-cache  --enable-symm-mem

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1024 --random-input 1024 --random-output 2048 --random-range-ratio 1 --warmup-request 1024 --max-concurrency 1024

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  99.23
Total input tokens:                      1048576
Total generated tokens:                  2097152
Total generated tokens (retokenized):    2088173
Request throughput (req/s):              10.32
Input token throughput (tok/s):          10566.84
Output token throughput (tok/s):         21133.69
Total token throughput (tok/s):          31700.53
Concurrency:                             1022.32
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   99069.51
Median E2E Latency (ms):                 99077.61
---------------Time to First Token----------------
Mean TTFT (ms):                          8812.57
Median TTFT (ms):                        8655.88
P99 TTFT (ms):                           15737.66
---------------Inter-Token Latency----------------
Mean ITL (ms):                           44.09
Median ITL (ms):                         40.47
P95 ITL (ms):                            44.81
P99 ITL (ms):                            46.93
Max ITL (ms):                            2788.35
==================================================

Before:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  104.55
Total input tokens:                      1048576
Total generated tokens:                  2097152
Total generated tokens (retokenized):    2090674
Request throughput (req/s):              9.79
Input token throughput (tok/s):          10029.54
Output token throughput (tok/s):         20059.08
Total token throughput (tok/s):          30088.62
Concurrency:                             1022.47
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104392.08
Median E2E Latency (ms):                 104404.71
---------------Time to First Token----------------
Mean TTFT (ms):                          8784.95
Median TTFT (ms):                        8615.13
P99 TTFT (ms):                           15843.36
---------------Inter-Token Latency----------------
Mean ITL (ms):                           46.71
Median ITL (ms):                         42.99
P95 ITL (ms):                            48.36
P99 ITL (ms):                            51.30
Max ITL (ms):                            2754.55
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

trevor-m · 2025-08-20T15:09:52Z

python/sglang/srt/layers/quantization/modelopt_quant.py

@@ -1306,7 +1320,15 @@ def apply(
                tune_max_num_tokens=next_power_of_2(x.shape[0]),
            )[0]
            if moe_runner_config.routed_scaling_factor is not None:
-                output *= moe_runner_config.routed_scaling_factor
+                with use_symmetric_memory(


Can we allocate a symm buffer and pass it as the output argument in flashinfer_cutlass_fused_moe() instead?
This multiply will be removed in #8690
Also *= is done in-place so the symmetric buffer will still be used until then

Yes thanks for catching that, let me do that.

trevor-m · 2025-08-20T18:33:35Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+            with use_symmetric_memory(
+                get_tp_group(), disabled=not is_max_padding()
+            ) as sm:
+                symm_output = torch.empty_like(x)


When x is quantized this will have the wrong dtype and shape (since hidden size will be half). We will want to use output_dtype and maybe store original x_col

yes I fixed that in #8934 . Thanks.

Register allgather/reducescatter buffers with symm memory

28f7f6f

trevor-m suggested changes Aug 20, 2025

View reviewed changes

Add fp4 allgather support

66b3a8c

nvcastet force-pushed the fp4_allgather branch from f8c5f13 to 66b3a8c Compare August 20, 2025 15:58

trevor-m suggested changes Aug 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Register Fp4 allgather with NCCL symmetric memory #9358

Register Fp4 allgather with NCCL symmetric memory #9358

Uh oh!

nvcastet commented Aug 19, 2025

Uh oh!

trevor-m Aug 20, 2025 •

edited

Loading

Uh oh!

nvcastet Aug 20, 2025

Uh oh!

trevor-m Aug 20, 2025

Uh oh!

nvcastet Aug 20, 2025

Uh oh!

Uh oh!

Register Fp4 allgather with NCCL symmetric memory #9358

Are you sure you want to change the base?

Register Fp4 allgather with NCCL symmetric memory #9358

Uh oh!

Conversation

nvcastet commented Aug 19, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

trevor-m Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvcastet Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

trevor-m Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

nvcastet Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trevor-m Aug 20, 2025 •

edited

Loading