Skip to content

use the same Test Environment and same Engine arguments as #4616 ,but the speed is slow #4649

@mengli0

Description

@mengli0

Test Environment:

SGLang version: 0.4.4.post1
Flashinfer version: 0.2.3+cu124torch2.5
Hardware: 2 nodes of H20 ( 8 * H20 96GiB each)
Model: DeepSeek-R1
Model Max Length: 3200 (modified in both model and NextN's tokenizer_config.json)
CUDA Version: 12.4
Operating System: Ubuntu SMP Fri Mar 18 12:42:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Test bench: jmeter
Avg input length = 912 tokens
Avg output length = 2174 tokens

Engine arguments:
python -m sglang.launch_server --model-path <YOUR_MODEL_DIR> --tp 16 --dist-init-addr <YOUR_ADDR> --nnodes 2 --node-rank <YOUR_NODE_RANK> --trust-remote-code --max-running-requests 1024 --speculative-algorithm NEXTN --speculative-draft <YOUR_NEXTN_MODEL_DIR> --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-torch-compile --enable-flashinfer-mla --mem-fraction-static 0.7 --host <YOUR_HOST_IP> --port <YOUR_HOST_PORT> --schedule-conservativeness 0.01

result:
Per-request Output Throughput (token/s) : 13.98 tokens/s (Client concurrency is10)

this result is slower than the result which is showed in #4616 ,how can I do to increase speed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions