[Performance] Dynamic Batch Tokenizer #9382

sundar24295s · 2025-08-20T06:21:21Z

Motivation

This PR introduces an AsyncDynamicBatchTokenizer that enables batching of tokenization requests to improve throughput and reduce latency for SGLang's tokenizer manager.

Performance Impact:

For Qwen3-Embedding-0.6B with 500 input tokens and 1 prompt per request at RPS 500, P99 latency improved from 4583 ms to 464 ms (~10× faster) with this PR.
In production systems, P99 latency is a critical metric as it represents the worst-case experience for users. This improvement allows systems to extract maximum throughput while maintaining acceptable P99 latency thresholds for SLA compliance.

Context

This PR builds upon the tokenization batching infrastructure introduced in PR #5141, which added enable_tokenizer_batch_encode for batching multiple texts within a single request.

Tokenization Batching Options

Feature	Use Case	When to Use
`enable_tokenizer_batch_encode` (PR #5141)	Client sends batched inputs in single request	When your client batches multiple texts into one API call
`enable_dynamic_batch_tokenizer` (This PR)	Client sends single prompts across multiple requests	When your client application sends individual requests

Example scenarios:

Use enable_tokenizer_batch_encode: Client sends {"input": ["text1", "text2", "text3"]} in one request
Use enable_dynamic_batch_tokenizer: Client sends multiple concurrent requests: {"input": "text1"}, {"input": "text2"}, {"input": "text3"}

Modifications

🚀 Dynamic Batching
- Automatically batches multiple concurrent tokenization requests for efficiency
- Processes single requests immediately when no other requests are pending
- Collects additional requests up to max_batch_size or batch_wait_timeout_s when queue has pending items
⚙️ Server Args
- max_batch_size (default: 32): Maximum number of requests to batch together
- batch_wait_timeout_s (default: 0.002s): Maximum time to wait for additional requests
- enable_dynamic_batch_tokenizer: Feature flag to enable/disable the functionality
- Usage
```
--enable-dynamic-batch-tokenizer \
--dynamic-batch-tokenizer-batch-size 32 \
--dynamic-batch-tokenizer-batch-timeout 0.002
```
🔄 Async Processing
- Non-blocking tokenization using asyncio and ThreadPoolExecutor
- Maintains event loop responsiveness while handling blocking tokenizer calls
- Scales efficiently with concurrent requests

Benchmarking and Profiling

Model Qwen3-Embedding-0.6B
Input Token Length = 500
GPU Type = H100
Traffic distribution = Poisson

Baseline Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding


| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15145          | 15145               | 0               | 71.19              | 71.44               | 29.73                | 28.18                | 35.59                | 52.94                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 19019          | 19019               | 0               | 75.75              | 76.04               | 53.17                | 34.21                | 58.68                | 436.09               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 20350          | 20350               | 0               | 81.05              | 185.33              | 2306.36              | 2345.45              | 4142.11              | 4583.08              |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 20216          | 20216               | 0               | 86.23              | 107.08              | 5939.38              | 5780.45              | 10926.77             | 11914.25             |

Batch Tokenizer Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding --enable-dynamic-batch-tokenizer



| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15079          | 15079               | 0               | 71.68              | 71.88               | 31.40                | 28.92                | 44.66                | 68.06                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 18965          | 18965               | 0               | 75.95              | 76.20               | 70.45                | 52.30                | 98.21                | 424.97               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 21972          | 21972               | 0               | 81.29              | 81.75               | 125.62               | 89.51                | 287.98               | 464.68               |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 24053          | 24053               | 0               | 88.15              | 118.13              | 560.86               | 560.43               | 751.65               | 908.02               |

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

hebiao064 · 2025-08-21T23:03:38Z

trigger ci, will review it by today or tmr

sundar24295s added 2 commits August 20, 2025 05:51

Dynamic Batch Tokenizer

ce45d15

Merge branch 'main' into suramach/batchtokenizer

88d0985

sundar24295s marked this pull request as ready for review August 20, 2025 07:44

sundar24295s requested review from merrymercy, Ying1123, hnyls2002 and xiezhq-hermann as code owners August 20, 2025 07:44

Merge branch 'main' into suramach/batchtokenizer

0d7cfaa

zhyncs assigned fzyzcjy, hebiao064 and hnyls2002 Aug 22, 2025

zhyncs added the high priority label Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Dynamic Batch Tokenizer #9382

[Performance] Dynamic Batch Tokenizer #9382

sundar24295s commented Aug 20, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Aug 21, 2025

Uh oh!

Uh oh!

[Performance] Dynamic Batch Tokenizer #9382

Are you sure you want to change the base?

[Performance] Dynamic Batch Tokenizer #9382

Conversation

sundar24295s commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Tokenization Batching Options

Modifications

Benchmarking and Profiling

Baseline Results

Batch Tokenizer Results

Checklist

Uh oh!

hebiao064 commented Aug 21, 2025

Uh oh!

Uh oh!

sundar24295s commented Aug 20, 2025 •

edited

Loading