-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Test Environment:
SGLang version: 0.4.4.post1
Flashinfer version: 0.2.3+cu124torch2.5
Hardware: 2 nodes of H20 ( 8 * H20 96GiB each)
Model: DeepSeek-R1
Model Max Length: 3200 (modified in both model and NextN's tokenizer_config.json)
CUDA Version: 12.4
Operating System: Ubuntu SMP Fri Mar 18 12:42:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Test bench: jmeter
Avg input length = 912 tokens
Avg output length = 2174 tokens
Engine arguments:
python -m sglang.launch_server --model-path <YOUR_MODEL_DIR> --tp 16 --dist-init-addr <YOUR_ADDR> --nnodes 2 --node-rank <YOUR_NODE_RANK> --trust-remote-code --max-running-requests 1024 --speculative-algorithm NEXTN --speculative-draft <YOUR_NEXTN_MODEL_DIR> --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-torch-compile --enable-flashinfer-mla --mem-fraction-static 0.7 --host <YOUR_HOST_IP> --port <YOUR_HOST_PORT> --schedule-conservativeness 0.01
result:
Per-request Output Throughput (token/s) : 13.98 tokens/s (Client concurrency is10)
this result is slower than the result which is showed in #4616 ,how can I do to increase speed?