Skip to content

[Performance] Dynamic Batch Tokenizer #9382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sundar24295s
Copy link
Contributor

@sundar24295s sundar24295s commented Aug 20, 2025

Motivation

  • This PR introduces an AsyncDynamicBatchTokenizer that enables batching of tokenization requests to improve throughput and reduce latency for SGLang's tokenizer manager.

Performance Impact:

  • For Qwen3-Embedding-0.6B with 500 input tokens and 1 prompt per request at RPS 500, P99 latency improved from 4583 ms to 464 ms (~10× faster) with this PR.
  • In production systems, P99 latency is a critical metric as it represents the worst-case experience for users. This improvement allows systems to extract maximum throughput while maintaining acceptable P99 latency thresholds for SLA compliance.
image

Context

This PR builds upon the tokenization batching infrastructure introduced in PR #5141, which added enable_tokenizer_batch_encode for batching multiple texts within a single request.

Tokenization Batching Options

Feature Use Case When to Use
enable_tokenizer_batch_encode (PR #5141) Client sends batched inputs in single request When your client batches multiple texts into one API call
enable_dynamic_batch_tokenizer (This PR) Client sends single prompts across multiple requests When your client application sends individual requests

Example scenarios:

  • Use enable_tokenizer_batch_encode: Client sends {"input": ["text1", "text2", "text3"]} in one request
  • Use enable_dynamic_batch_tokenizer: Client sends multiple concurrent requests: {"input": "text1"}, {"input": "text2"}, {"input": "text3"}

Modifications

  • 🚀 Dynamic Batching

    • Automatically batches multiple concurrent tokenization requests for efficiency
    • Processes single requests immediately when no other requests are pending
    • Collects additional requests up to max_batch_size or batch_wait_timeout_s when queue has pending items
  • ⚙️ Server Args

    • max_batch_size (default: 32): Maximum number of requests to batch together
    • batch_wait_timeout_s (default: 0.002s): Maximum time to wait for additional requests
    • enable_dynamic_batch_tokenizer: Feature flag to enable/disable the functionality
    • Usage
      --enable-dynamic-batch-tokenizer \
      --dynamic-batch-tokenizer-batch-size 32 \
      --dynamic-batch-tokenizer-batch-timeout 0.002
      
  • 🔄 Async Processing

    • Non-blocking tokenization using asyncio and ThreadPoolExecutor
    • Maintains event loop responsiveness while handling blocking tokenizer calls
    • Scales efficiently with concurrent requests

Benchmarking and Profiling

  • Model Qwen3-Embedding-0.6B
  • Input Token Length = 500
  • GPU Type = H100
  • Traffic distribution = Poisson

Baseline Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding

| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15145          | 15145               | 0               | 71.19              | 71.44               | 29.73                | 28.18                | 35.59                | 52.94                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 19019          | 19019               | 0               | 75.75              | 76.04               | 53.17                | 34.21                | 58.68                | 436.09               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 20350          | 20350               | 0               | 81.05              | 185.33              | 2306.36              | 2345.45              | 4142.11              | 4583.08              |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 20216          | 20216               | 0               | 86.23              | 107.08              | 5939.38              | 5780.45              | 10926.77             | 11914.25             |

Batch Tokenizer Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding --enable-dynamic-batch-tokenizer


| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15079          | 15079               | 0               | 71.68              | 71.88               | 31.40                | 28.92                | 44.66                | 68.06                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 18965          | 18965               | 0               | 75.95              | 76.20               | 70.45                | 52.30                | 98.21                | 424.97               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 21972          | 21972               | 0               | 81.29              | 81.75               | 125.62               | 89.51                | 287.98               | 464.68               |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 24053          | 24053               | 0               | 88.15              | 118.13              | 560.86               | 560.43               | 751.65               | 908.02               |

Checklist

@sundar24295s sundar24295s marked this pull request as ready for review August 20, 2025 07:44
@hebiao064
Copy link
Collaborator

trigger ci, will review it by today or tmr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants