Support server based rollout in Verlengine #4848

yitianlian · 2025-03-28T05:19:43Z

Motivation

Currently, Verlengine only supports engine-based rollout. However, server-based rollout not only supports all the features of engine-based rollout but also enables API call functionality, which is easier for continuous bataching. Therefore, we present a server-based rollout solution.

Modifications

My modifications focus on three main aspects:

Support for update_weights_from_tensor in the HTTP server:
To maintain compatibility with existing Verlengine features, I added support for the update_weights_from_tensor method in the HTTP server. This function receives metadata instead of the full tensor, allowing for fast and efficient weight updates.
Implementation of HttpServerEngineAdapter:
I introduced the HttpServerEngineAdapter class, which mirrors all the methods used by the engine in the Verlengine class. This allows it to function as a server-based rollout engine with full feature compatibility.
Test file for validation:
A dedicated test file was added to ensure the correctness of the implementation and verify that all functionalities work as expected.

@jhinpan @zhaochenyang20

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhaochenyang20 · 2025-03-28T16:12:48Z

Great work Chengxing, I will review all the design docs, dev logs and codes carefully.

yangky11 · 2025-04-03T04:52:54Z

I'm trying out this PR but got the error below:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct &
[1] 3188749
(wei) kaiyuy@learnfair6036:~/WEI/wei-agent$ INFO 04-03 04:46:18 __init__.py:190] Automatically detected platform cuda.
[2025-04-03 04:46:21] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=827599081, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode=None, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
INFO 04-03 04:46:28 __init__.py:190] Automatically detected platform cuda.
INFO 04-03 04:46:28 __init__.py:190] Automatically detected platform cuda.
[2025-04-03 04:46:31 TP0] Init torch distributed begin.
[2025-04-03 04:46:31 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-03 04:46:31 TP0] Load weight begin. avail mem=78.73 GB
[2025-04-03 04:46:32 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.09it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]

[2025-04-03 04:46:35 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.63 GB, mem usage=15.10 GB.
[2025-04-03 04:46:36 TP0] KV Cache is allocated. #tokens: 443865, K size: 27.09 GB, V size: 27.09 GB
[2025-04-03 04:46:36 TP0] Memory pool end. avail mem=8.34 GB
[2025-04-03 04:46:36 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=7.85 GB
Capturing batches (avail_mem=6.56 GB): 100%|███████████████████████████████████████████████████████████████████| 23/23 [00:06<00:00,  3.57it/s]
[2025-04-03 04:46:42 TP0] Capture cuda graph end. Time elapsed: 6.44 s. avail mem=6.54 GB. mem usage=1.30 GB.
[2025-04-03 04:46:43 TP0] max_total_num_tokens=443865, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-04-03 04:46:43] INFO:     Started server process [3188749]
[2025-04-03 04:46:43] INFO:     Waiting for application startup.
[2025-04-03 04:46:43] INFO:     Application startup complete.
[2025-04-03 04:46:43] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-04-03 04:46:44] INFO:     127.0.0.1:39848 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-03 04:46:44 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-03 04:46:47] INFO:     127.0.0.1:39864 - "POST /generate HTTP/1.1" 200 OK
[2025-04-03 04:46:47] The server is fired up and ready to roll!

(wei) kaiyuy@learnfair6036:~/WEI/wei-agent$ ipython
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pickle

In [2]: import requests

In [3]: import base64

In [4]: from sglang.srt.utils import MultiprocessingSerializer

In [5]: from transformers import AutoModel

In [6]: model = AutoModel.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.15s/it]

In [7]: named_tensors = {name: tensor for name, tensor in model.named_parameters()}

In [8]: serialized_named_tensors = [base64.b64encode(pickle.dumps(MultiprocessingSerializer.serialize(named_tensors))).decode("utf-8")]

In [9]: requests.get("http://127.0.0.1:30000/health_generate")
[2025-04-03 04:50:13 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-03 04:50:14] INFO:     127.0.0.1:57600 - "GET /health_generate HTTP/1.1" 200 OK
Out[9]: <Response [200]>

In [10]: requests.post("http://127.0.0.1:30000/update_weights_from_tensor", json={"serialized_named_tensors": serialized_named_tensors, "load_f
    ...: ormat": None, "flush_cache": False})
[2025-04-03 04:50:25 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 2011, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 614, in event_loop_overlap
    self.process_input_requests(recv_reqs)
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 778, in process_input_requests
    output = self._request_dispatcher(recv_req)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/utils.py", line 471, in __call__
    return fn(obj)
           ^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in update_weights_from_tensor
    success, message = self.tp_worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 232, in update_weights_from_tensor
    success, message = self.worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker.py", line 219, in update_weights_from_tensor
    named_tensors=MultiprocessingSerializer.deserialize(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/utils.py", line 1531, in deserialize
    return ForkingPickler.loads(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 964, in answer_challenge
    raise AuthenticationError('digest sent was rejected')
multiprocessing.context.AuthenticationError: digest sent was rejected

[2025-04-03 04:50:25] Received sigquit from a child process. It usually means the child failed.
---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py:138, in _ResourceSharer._serve(self)
    136 while 1:
    137     try:
--> 138         with self._listener.accept() as conn:
    139             msg = conn.recv()
    140             if msg is None:

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:482, in Listener.accept(self)
    480 c = self._listener.accept()
    481 if self._authkey is not None:
--> 482     deliver_challenge(c, self._authkey)
    483     answer_challenge(c, self._authkey)
    484 return c

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:941, in deliver_challenge(connection, authkey, digest_name)
    939 response = connection.recv_bytes(256)        # reject large message
    940 try:
--> 941     _verify_challenge(authkey, message, response)
    942 except AuthenticationError:
    943     connection.send_bytes(_FAILURE)

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:925, in _verify_challenge(authkey, message, response)
    921     raise AuthenticationError(
    922             f'expected {response_digest!r} of length {len(expected)} '
    923             f'got {len(response_mac)}')
    924 if not hmac.compare_digest(expected, response_mac):
--> 925     raise AuthenticationError('digest received was wrong')

AuthenticationError: digest received was wrong
---------------------------------------------------------------------------
RemoteDisconnected                        Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:1430, in HTTPConnection.getresponse(self)
   1429 try:
-> 1430     response.begin()
   1431 except ConnectionError:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:300, in HTTPResponse._read_status(self)
    297 if not line:
    298     # Presumably, the server closed the connection before
    299     # sending a valid response.
--> 300     raise RemoteDisconnected("Remote end closed connection without"
    301                              " response")
    302 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/retry.py:474, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    473 if read is False or method is None or not self._is_method_retryable(method):
--> 474     raise reraise(type(error), error, _stacktrace)
    475 elif read is not None:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/util.py:38, in reraise(tp, value, tb)
     37 if value.__traceback__ is not tb:
---> 38     raise value.with_traceback(tb)
     39 raise value

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:1430, in HTTPConnection.getresponse(self)
   1429 try:
-> 1430     response.begin()
   1431 except ConnectionError:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:300, in HTTPResponse._read_status(self)
    297 if not line:
    298     # Presumably, the server closed the connection before
    299     # sending a valid response.
--> 300     raise RemoteDisconnected("Remote end closed connection without"
    301                              " response")
    302 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[10], line 1
----> 1 requests.post("http://127.0.0.1:30000/update_weights_from_tensor", json={"serialized_named_tensors": serialized_named_tensors, "load_format": None, "flush_cache": False})

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py:682, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
   (...)
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:
--> 682     raise ConnectionError(err, request=request)
    684 except MaxRetryError as e:
    685     if isinstance(e.reason, ConnectTimeoutError):
    686         # TODO: Remove this in 3.0.0: see #2811

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

zhaochenyang20 · 2025-04-03T05:04:28Z

I'm trying out this PR but got the error below:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct &
[1] 3188749
(wei) kaiyuy@learnfair6036:~/WEI/wei-agent$ INFO 04-03 04:46:18 __init__.py:190] Automatically detected platform cuda.
[2025-04-03 04:46:21] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=827599081, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode=None, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
INFO 04-03 04:46:28 __init__.py:190] Automatically detected platform cuda.
INFO 04-03 04:46:28 __init__.py:190] Automatically detected platform cuda.
[2025-04-03 04:46:31 TP0] Init torch distributed begin.
[2025-04-03 04:46:31 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-03 04:46:31 TP0] Load weight begin. avail mem=78.73 GB
[2025-04-03 04:46:32 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.09it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]

[2025-04-03 04:46:35 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.63 GB, mem usage=15.10 GB.
[2025-04-03 04:46:36 TP0] KV Cache is allocated. #tokens: 443865, K size: 27.09 GB, V size: 27.09 GB
[2025-04-03 04:46:36 TP0] Memory pool end. avail mem=8.34 GB
[2025-04-03 04:46:36 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=7.85 GB
Capturing batches (avail_mem=6.56 GB): 100%|███████████████████████████████████████████████████████████████████| 23/23 [00:06<00:00,  3.57it/s]
[2025-04-03 04:46:42 TP0] Capture cuda graph end. Time elapsed: 6.44 s. avail mem=6.54 GB. mem usage=1.30 GB.
[2025-04-03 04:46:43 TP0] max_total_num_tokens=443865, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-04-03 04:46:43] INFO:     Started server process [3188749]
[2025-04-03 04:46:43] INFO:     Waiting for application startup.
[2025-04-03 04:46:43] INFO:     Application startup complete.
[2025-04-03 04:46:43] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-04-03 04:46:44] INFO:     127.0.0.1:39848 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-03 04:46:44 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-03 04:46:47] INFO:     127.0.0.1:39864 - "POST /generate HTTP/1.1" 200 OK
[2025-04-03 04:46:47] The server is fired up and ready to roll!

(wei) kaiyuy@learnfair6036:~/WEI/wei-agent$ ipython
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pickle

In [2]: import requests

In [3]: import base64

In [4]: from sglang.srt.utils import MultiprocessingSerializer

In [5]: from transformers import AutoModel

In [6]: model = AutoModel.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.15s/it]

In [7]: named_tensors = {name: tensor for name, tensor in model.named_parameters()}

In [8]: serialized_named_tensors = [base64.b64encode(pickle.dumps(MultiprocessingSerializer.serialize(named_tensors))).decode("utf-8")]

In [9]: requests.get("http://127.0.0.1:30000/health_generate")
[2025-04-03 04:50:13 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-03 04:50:14] INFO:     127.0.0.1:57600 - "GET /health_generate HTTP/1.1" 200 OK
Out[9]: <Response [200]>

In [10]: requests.post("http://127.0.0.1:30000/update_weights_from_tensor", json={"serialized_named_tensors": serialized_named_tensors, "load_f
    ...: ormat": None, "flush_cache": False})
[2025-04-03 04:50:25 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 2011, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 614, in event_loop_overlap
    self.process_input_requests(recv_reqs)
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 778, in process_input_requests
    output = self._request_dispatcher(recv_req)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/utils.py", line 471, in __call__
    return fn(obj)
           ^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in update_weights_from_tensor
    success, message = self.tp_worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 232, in update_weights_from_tensor
    success, message = self.worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker.py", line 219, in update_weights_from_tensor
    named_tensors=MultiprocessingSerializer.deserialize(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/utils.py", line 1531, in deserialize
    return ForkingPickler.loads(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 964, in answer_challenge
    raise AuthenticationError('digest sent was rejected')
multiprocessing.context.AuthenticationError: digest sent was rejected

[2025-04-03 04:50:25] Received sigquit from a child process. It usually means the child failed.
---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py:138, in _ResourceSharer._serve(self)
    136 while 1:
    137     try:
--> 138         with self._listener.accept() as conn:
    139             msg = conn.recv()
    140             if msg is None:

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:482, in Listener.accept(self)
    480 c = self._listener.accept()
    481 if self._authkey is not None:
--> 482     deliver_challenge(c, self._authkey)
    483     answer_challenge(c, self._authkey)
    484 return c

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:941, in deliver_challenge(connection, authkey, digest_name)
    939 response = connection.recv_bytes(256)        # reject large message
    940 try:
--> 941     _verify_challenge(authkey, message, response)
    942 except AuthenticationError:
    943     connection.send_bytes(_FAILURE)

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:925, in _verify_challenge(authkey, message, response)
    921     raise AuthenticationError(
    922             f'expected {response_digest!r} of length {len(expected)} '
    923             f'got {len(response_mac)}')
    924 if not hmac.compare_digest(expected, response_mac):
--> 925     raise AuthenticationError('digest received was wrong')

AuthenticationError: digest received was wrong
---------------------------------------------------------------------------
RemoteDisconnected                        Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:1430, in HTTPConnection.getresponse(self)
   1429 try:
-> 1430     response.begin()
   1431 except ConnectionError:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:300, in HTTPResponse._read_status(self)
    297 if not line:
    298     # Presumably, the server closed the connection before
    299     # sending a valid response.
--> 300     raise RemoteDisconnected("Remote end closed connection without"
    301                              " response")
    302 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/retry.py:474, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    473 if read is False or method is None or not self._is_method_retryable(method):
--> 474     raise reraise(type(error), error, _stacktrace)
    475 elif read is not None:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/util.py:38, in reraise(tp, value, tb)
     37 if value.__traceback__ is not tb:
---> 38     raise value.with_traceback(tb)
     39 raise value

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:1430, in HTTPConnection.getresponse(self)
   1429 try:
-> 1430     response.begin()
   1431 except ConnectionError:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:300, in HTTPResponse._read_status(self)
    297 if not line:
    298     # Presumably, the server closed the connection before
    299     # sending a valid response.
--> 300     raise RemoteDisconnected("Remote end closed connection without"
    301                              " response")
    302 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[10], line 1
----> 1 requests.post("http://127.0.0.1:30000/update_weights_from_tensor", json={"serialized_named_tensors": serialized_named_tensors, "load_format": None, "flush_cache": False})

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py:682, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
   (...)
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:
--> 682     raise ConnectionError(err, request=request)
    684 except MaxRetryError as e:
    685     if isinstance(e.reason, ConnectTimeoutError):
    686         # TODO: Remove this in 3.0.0: see #2811

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Chenxing, could you try to help on this?

yitianlian · 2025-04-03T05:07:18Z

I'm trying out this PR but got the error below:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct &
[1] 3188749
(wei) kaiyuy@learnfair6036:~/WEI/wei-agent$ INFO 04-03 04:46:18 __init__.py:190] Automatically detected platform cuda.
[2025-04-03 04:46:21] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=827599081, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode=None, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
INFO 04-03 04:46:28 __init__.py:190] Automatically detected platform cuda.
INFO 04-03 04:46:28 __init__.py:190] Automatically detected platform cuda.
[2025-04-03 04:46:31 TP0] Init torch distributed begin.
[2025-04-03 04:46:31 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-03 04:46:31 TP0] Load weight begin. avail mem=78.73 GB
[2025-04-03 04:46:32 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.14it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.09it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]

[2025-04-03 04:46:35 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.63 GB, mem usage=15.10 GB.
[2025-04-03 04:46:36 TP0] KV Cache is allocated. #tokens: 443865, K size: 27.09 GB, V size: 27.09 GB
[2025-04-03 04:46:36 TP0] Memory pool end. avail mem=8.34 GB
[2025-04-03 04:46:36 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=7.85 GB
Capturing batches (avail_mem=6.56 GB): 100%|███████████████████████████████████████████████████████████████████| 23/23 [00:06<00:00,  3.57it/s]
[2025-04-03 04:46:42 TP0] Capture cuda graph end. Time elapsed: 6.44 s. avail mem=6.54 GB. mem usage=1.30 GB.
[2025-04-03 04:46:43 TP0] max_total_num_tokens=443865, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-04-03 04:46:43] INFO:     Started server process [3188749]
[2025-04-03 04:46:43] INFO:     Waiting for application startup.
[2025-04-03 04:46:43] INFO:     Application startup complete.
[2025-04-03 04:46:43] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-04-03 04:46:44] INFO:     127.0.0.1:39848 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-03 04:46:44 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-03 04:46:47] INFO:     127.0.0.1:39864 - "POST /generate HTTP/1.1" 200 OK
[2025-04-03 04:46:47] The server is fired up and ready to roll!

(wei) kaiyuy@learnfair6036:~/WEI/wei-agent$ ipython
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pickle

In [2]: import requests

In [3]: import base64

In [4]: from sglang.srt.utils import MultiprocessingSerializer

In [5]: from transformers import AutoModel

In [6]: model = AutoModel.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.15s/it]

In [7]: named_tensors = {name: tensor for name, tensor in model.named_parameters()}

In [8]: serialized_named_tensors = [base64.b64encode(pickle.dumps(MultiprocessingSerializer.serialize(named_tensors))).decode("utf-8")]

In [9]: requests.get("http://127.0.0.1:30000/health_generate")
[2025-04-03 04:50:13 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-03 04:50:14] INFO:     127.0.0.1:57600 - "GET /health_generate HTTP/1.1" 200 OK
Out[9]: <Response [200]>

In [10]: requests.post("http://127.0.0.1:30000/update_weights_from_tensor", json={"serialized_named_tensors": serialized_named_tensors, "load_f
    ...: ormat": None, "flush_cache": False})
[2025-04-03 04:50:25 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 2011, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 614, in event_loop_overlap
    self.process_input_requests(recv_reqs)
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 778, in process_input_requests
    output = self._request_dispatcher(recv_req)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/utils.py", line 471, in __call__
    return fn(obj)
           ^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in update_weights_from_tensor
    success, message = self.tp_worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 232, in update_weights_from_tensor
    success, message = self.worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker.py", line 219, in update_weights_from_tensor
    named_tensors=MultiprocessingSerializer.deserialize(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/utils.py", line 1531, in deserialize
    return ForkingPickler.loads(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 964, in answer_challenge
    raise AuthenticationError('digest sent was rejected')
multiprocessing.context.AuthenticationError: digest sent was rejected

[2025-04-03 04:50:25] Received sigquit from a child process. It usually means the child failed.
---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py:138, in _ResourceSharer._serve(self)
    136 while 1:
    137     try:
--> 138         with self._listener.accept() as conn:
    139             msg = conn.recv()
    140             if msg is None:

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:482, in Listener.accept(self)
    480 c = self._listener.accept()
    481 if self._authkey is not None:
--> 482     deliver_challenge(c, self._authkey)
    483     answer_challenge(c, self._authkey)
    484 return c

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:941, in deliver_challenge(connection, authkey, digest_name)
    939 response = connection.recv_bytes(256)        # reject large message
    940 try:
--> 941     _verify_challenge(authkey, message, response)
    942 except AuthenticationError:
    943     connection.send_bytes(_FAILURE)

File ~/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py:925, in _verify_challenge(authkey, message, response)
    921     raise AuthenticationError(
    922             f'expected {response_digest!r} of length {len(expected)} '
    923             f'got {len(response_mac)}')
    924 if not hmac.compare_digest(expected, response_mac):
--> 925     raise AuthenticationError('digest received was wrong')

AuthenticationError: digest received was wrong
---------------------------------------------------------------------------
RemoteDisconnected                        Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:1430, in HTTPConnection.getresponse(self)
   1429 try:
-> 1430     response.begin()
   1431 except ConnectionError:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:300, in HTTPResponse._read_status(self)
    297 if not line:
    298     # Presumably, the server closed the connection before
    299     # sending a valid response.
--> 300     raise RemoteDisconnected("Remote end closed connection without"
    301                              " response")
    302 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/retry.py:474, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    473 if read is False or method is None or not self._is_method_retryable(method):
--> 474     raise reraise(type(error), error, _stacktrace)
    475 elif read is not None:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/util.py:38, in reraise(tp, value, tb)
     37 if value.__traceback__ is not tb:
---> 38     raise value.with_traceback(tb)
     39 raise value

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:1430, in HTTPConnection.getresponse(self)
   1429 try:
-> 1430     response.begin()
   1431 except ConnectionError:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File ~/miniconda3/envs/wei/lib/python3.12/http/client.py:300, in HTTPResponse._read_status(self)
    297 if not line:
    298     # Presumably, the server closed the connection before
    299     # sending a valid response.
--> 300     raise RemoteDisconnected("Remote end closed connection without"
    301                              " response")
    302 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[10], line 1
----> 1 requests.post("http://127.0.0.1:30000/update_weights_from_tensor", json={"serialized_named_tensors": serialized_named_tensors, "load_format": None, "flush_cache": False})

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py:682, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
   (...)
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:
--> 682     raise ConnectionError(err, request=request)
    684 except MaxRetryError as e:
    685     if isinstance(e.reason, ConnectTimeoutError):
    686         # TODO: Remove this in 3.0.0: see #2811

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Chenxing, could you try to help on this?

Sure!

yitianlian · 2025-04-03T05:20:16Z

@yangky11
You can take a closer look at the update_weights_from_tensor method in both the VerlEngine and the HttpServerEngineAdapter. When using named_tensors, the correct flow is to first call the method in VerlEngine, which internally handles preprocessing by converting the tensors into LocalSerializedTensor. Only after this step does it delegate the update to the HttpServerEngineAdapter.

However, in your current implementation, you're calling update_weights_from_tensor directly on the HttpServerEngineAdapter, which bypasses the necessary preprocessing steps in VerlEngine, and leads to an error.

The correct order of operations should be as follows:

def update_weights_from_tensor(
        self,
        named_tensors: List[Tuple[str, torch.Tensor]],
        load_format: Optional[str] = None,
    ):
        # Most naive implementation, can optimize a lot if it is bottleneck
        for tensor_index, (name, tensor) in enumerate(named_tensors):
            serialized_tensor = MultiprocessingSerializer.serialize(
                _preprocess_tensor_for_update_weights(tensor)
            )

            if self._tp_rank == 0:
                gathered_serialized_tensors = [None for _ in range(self._tp_size)]
            else:
                gathered_serialized_tensors = None
            dist.gather_object(
                obj=serialized_tensor,
                object_gather_list=gathered_serialized_tensors,
                dst=self._device_mesh_cpu.mesh.tolist()[0],
                group=self._device_mesh_cpu.get_group(),
            )

            if self._tp_rank == 0:
                self._engine.update_weights_from_tensor(
                    named_tensors=[
                        (
                            name,
                            LocalSerializedTensor(values=gathered_serialized_tensors),
                        )
                    ],
                    load_format=load_format,
                    flush_cache=tensor_index == len(named_tensors) - 1,
                )

and then in self._engine.update_weights_from_tensor:

def update_weights_from_tensor(
        self,
        named_tensors: List[Tuple[str, torch.Tensor]],
        load_format: Optional[str] = None,
        flush_cache: bool = False,
    ):

        print(f"update_weights_from_tensor of HttpServerEngineAdapter")
        return requests.post(
            f"http://localhost:{self.server_args.port}/update_weights_from_tensor",
            json={
                "serialized_named_tensors": [
                    serialize_for_http(
                        MultiprocessingSerializer.serialize(named_tensors)
                    )
                    for _ in range(self.server_args.tp_size)
                ],
                "load_format": load_format,
                "flush_cache": flush_cache,
            },
        )

yangky11 · 2025-04-03T14:49:42Z

@yitianlian That makes sense! I'm trying to have a minimal example that performs the preprocessing in VerlEngine and HttpServerEngineAdapter before sending requests to the SGLang server.

Like before, I launched the SGLang server using python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct &. Then, I ran the Python code below. The code was adapted from VerlEngine and HttpServerEngineAdapter with some simplification (I'm only updating a single parameter, and I omitted dist.gather since tp == 1).

import requests
import pickle
import base64
from transformers import AutoModel
from torch.distributed.tensor import DTensor
from sglang.srt.utils import MultiprocessingSerializer
from sglang.srt.model_executor.model_runner import LocalSerializedTensor


def serialize_for_http(data):
    # First pickle the data, then convert to base64 for safe HTTP transmission
    pickled = pickle.dumps(data)
    return base64.b64encode(pickled).decode("utf-8")


model = AutoModel.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
name, tensor = list(model.state_dict().items())[0]
print(name, tensor)
serialized_tensor = MultiprocessingSerializer.serialize(
    tensor.full_tensor() if isinstance(tensor, DTensor) else tensor
)
print(serialized_tensor)
gathered_serialized_tensors = [serialized_tensor]

requests.post(
    "http://127.0.0.1:30000/update_weights_from_tensor",
    json={
        "serialized_named_tensors": [
            serialize_for_http(
                MultiprocessingSerializer.serialize(
                    [(name, LocalSerializedTensor(values=gathered_serialized_tensors))]
                )
            )
        ],
        "load_format": None,
        "flush_cache": False,
    },
)

Here is the error I got:

(wei) kaiyuy@learnfair6021:~/WEI/wei-agent$ python tmp.py 
INFO 04-03 14:42:36 __init__.py:190] Automatically detected platform cuda.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.19s/it]
embed_tokens.weight tensor([[ 1.0605e-03,  5.6152e-03, -3.4180e-03,  ...,  4.1199e-03,
         -2.8076e-03, -6.7139e-04],
        [-3.7384e-03,  9.7275e-04, -1.8158e-03,  ...,  1.5259e-03,
         -2.2583e-03, -1.3504e-03],
        [ 1.4420e-03, -1.6968e-02,  3.1586e-03,  ...,  2.9907e-03,
          9.5215e-03,  4.8828e-03],
        ...,
        [ 2.2127e-23,  3.9033e-24,  2.1610e-23,  ...,  6.3693e-23,
         -2.6496e-24, -2.3575e-23],
        [ 2.2851e-23, -2.2101e-24, -2.2230e-23,  ...,  2.7917e-23,
          8.6854e-24, -3.7016e-23],
        [-8.8508e-23, -7.5687e-23,  6.4882e-24,  ...,  5.8937e-24,
         -6.4520e-23, -2.7142e-24]])
b'\x80\x04\x95Q\x01\x00\x00\x00\x00\x00\x00\x8c torch.multiprocessing.reductions\x94\x8c\x0erebuild_tensor\x94\x93\x94\x8c\x05torch\x94\x8c\x06Tensor\x94\x93\x94h\x00\x8c\x15rebuild_typed_storage\x94\x93\x94h\x00\x8c\x12rebuild_storage_fd\x94\x93\x94\x8c\rtorch.storage\x94\x8c\x0eUntypedStorage\x94\x93\x94\x8c\x1fmultiprocessing.resource_sharer\x94\x8c\x05DupFd\x94\x93\x94)\x81\x94}\x94\x8c\x03_id\x94\x8c$/tmp/pymp-p7p735mm/listener-5w86fik5\x94K\x01\x86\x94sbJ\x00\x00@}\x87\x94R\x94\x8c\x05torch\x94\x8c\x07float32\x94\x93\x94\x86\x94R\x94(K\x00h\x03\x8c\x04Size\x94\x93\x94J\x00\xf5\x01\x00M\x00\x10\x86\x94\x85\x94R\x94M\x00\x10K\x01\x86\x94\x89t\x94\x87\x94R\x94.'
Traceback (most recent call last):
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 482, in accept
    deliver_challenge(c, self._authkey)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 941, in deliver_challenge
    _verify_challenge(authkey, message, response)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 925, in _verify_challenge
    raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrong
[2025-04-03 14:42:45 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 2011, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 614, in event_loop_overlap
    self.process_input_requests(recv_reqs)
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 778, in process_input_requests
    output = self._request_dispatcher(recv_req)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/utils.py", line 471, in __call__
    return fn(obj)
           ^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/scheduler.py", line 1773, in update_weights_from_tensor
    success, message = self.tp_worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 232, in update_weights_from_tensor
    success, message = self.worker.update_weights_from_tensor(recv_req)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/managers/tp_worker.py", line 218, in update_weights_from_tensor
    success, message = self.model_runner.update_weights_from_tensor(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/model_executor/model_runner.py", line 592, in update_weights_from_tensor
    (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/model_executor/model_runner.py", line 1077, in _unwrap_tensor
    tensor = tensor.get(tp_rank)
             ^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/model_executor/model_runner.py", line 1089, in get
    return MultiprocessingSerializer.deserialize(self.values[rank])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/sglang/python/sglang/srt/utils.py", line 1531, in deserialize
    return ForkingPickler.loads(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/multiprocessing/connection.py", line 964, in answer_challenge
    raise AuthenticationError('digest sent was rejected')
multiprocessing.context.AuthenticationError: digest sent was rejected

[2025-04-03 14:42:45] Received sigquit from a child process. It usually means the child failed.
Traceback (most recent call last):
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py", line 516, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/urllib3/connection.py", line 516, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/private/home/kaiyuy/WEI/wei-agent/tmp.py", line 25, in <module>
    requests.post(
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/kaiyuy/miniconda3/envs/wei/lib/python3.12/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I'm wondering if I'm still misunderstanding how /update_weights_from_tensor should be called and would appreciate your help!

jhinpan · 2025-04-04T04:35:44Z

cc @fzyzcjy . Hi Tom, would you mind have a check on this inter-process communication issue? Huge thanks!

fzyzcjy · 2025-04-04T05:40:03Z

@yangky11 Hi, the implementation approach in this PR (HttpServerEngineAdapter, update_weights_from_tensor in http server, etc) was proposed by me, thus I will handle this.

The digest issue seems to because that, a tensor on CPU is being sent across processes (when calling update weights) and the two processes do not have parent-child relation. Therefore, make the tensor on GPU will solve the problem. If the tensor needs be on CPU for some reasons (would like to know a bit why, since it seems to introduce extra CPU-GPU copies in RL scenario), I can make a PR about it.

Btw I am also doing AI for mathematics (as an individual researcher), and am interested in engineering such as SGLang. If there are things that SGLang needs to do for AI4math, I am happy to discuss.

fzyzcjy · 2025-04-04T05:40:34Z

python/sglang/srt/entrypoints/http_server.py

+    import base64
+    import pickle


nit: maybe we can move this import to top of file

fzyzcjy · 2025-04-04T05:41:28Z

python/sglang/srt/entrypoints/http_server.py

+@app.post("/update_weights_from_tensor")
+async def update_weights_from_tensor(
+    obj: UpdateWeightsFromTensorReqInput, request: Request
+):
+    obj.serialized_named_tensors = [
+        deserialize_from_http(item) for item in obj.serialized_named_tensors
+    ]


wondering whether we should make the API like, Union[UpdateWeightsFromTensorReqInput, str] (and the "str" means serializing the whole requests into a string using base64), because this may be a bit more general

fzyzcjy · 2025-04-04T05:43:22Z

python/sglang/srt/entrypoints/http_server_engine.py

+from sglang.test.test_utils import (
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+    popen_launch_server,
+)


nit: may not be very ideal if we import test code from main code

fzyzcjy · 2025-04-04T05:44:09Z

python/sglang/srt/entrypoints/http_server_engine.py

+def serialize_for_http(data):
+    # First pickle the data, then convert to base64 for safe HTTP transmission
+    pickled = pickle.dumps(data)
+    return base64.b64encode(pickled).decode("utf-8")


nit: maybe we can put serialize_for_http and deserialize_from_http, into one common place, because they are pairs that will be used together (for example, maybe write like MultiprocessingSerializer)

fzyzcjy · 2025-04-04T05:44:48Z

python/sglang/srt/entrypoints/http_server_engine.py

+    return base64.b64encode(pickled).decode("utf-8")
+
+
+import dataclasses


nit: would be great to make import at top of file

fzyzcjy · 2025-04-04T05:58:04Z

python/sglang/srt/entrypoints/verl_engine.py

+                    kwargs["log_level"] = "error"
+                server_args = ServerArgs(**kwargs)
+            if self._tp_rank == 0:
+                self._engine = HttpServerEngineAdapter(server_args)


wondering whether we are missing tp_size, node_rank, nnodes here

fzyzcjy · 2025-04-04T05:58:46Z

python/sglang/srt/entrypoints/verl_engine.py

+                    kwargs["log_level"] = "error"
+                server_args = ServerArgs(**kwargs)
+            if self._tp_rank == 0:
+                self._engine = HttpServerEngineAdapter(server_args)


to make Engine and HttpServerEngineAdapter be as identical as possible, it would be great to let HttpServerEngineAdapter accept same args as Engine, i.e. accept kwargs instead of ServerArgs.

fzyzcjy · 2025-04-04T06:00:07Z

python/sglang/srt/entrypoints/verl_engine.py

+            if "server_args" in kwargs:
+                # Directly load server_args
+                server_args = kwargs["server_args"]
+            else:
+                # Construct server_args from kwargs
+                if "log_level" not in kwargs:
+                    # Do not print logs by default
+                    kwargs["log_level"] = "error"
+                server_args = ServerArgs(**kwargs)


Hmm looks like the code is copied from Engine, if so then

it would be great to avoid copying code, e.g. extract a function compute_server_args_from_kwargs

the layering is a little bit weird if we keep it: the same logic, "computing serverargs from kwargs", is put (a) in Engine/Adapter layer, and (b) in the VerlEngine layer which is one layer above Engine/Adapter. thus would be great to move it to Adapter.

fzyzcjy · 2025-04-04T06:01:39Z

python/sglang/srt/entrypoints/verl_engine.py

            os.environ["SGLANG_BLOCK_NONZERO_RANK_CHILDREN"] = "0"
            self._engine = Engine(
                **kwargs, tp_size=self._tp_size, node_rank=node_rank, nnodes=nnodes
            )
+        elif "launch_server" in kwargs and kwargs["launch_server"]:


the top-level if-elif-else currently does:

if: for Engine, part 1

elif: for Adapter

else: for Engine, part 2

Thus would be great to have

if backend == 'engine': if first rank in node: self._engine = Engine() else: self._engine = None elif backend == 'server': if rank zero: self._engine = Adapter() else: self._engine = None else: raise error

fzyzcjy · 2025-04-04T06:02:43Z

test/srt/test_verl_engine_server.py

given that "VerlEngine using Engine" and "VerlEngine using Adapter" should behave almost identical to users, maybe we should not create a new test, but instead change some lines to the original test. This can both reduce code duplication, and also ensure we are (implicitly) checking they behave similarly.

(I have not checked this file - will check after this change)

yitianlian · 2025-04-04T14:32:40Z

Given the comments from @fzyzcjy , I've rewritten my code and addressed most of the issues that were raised. However, there are still a few points that need further discussion:

In the update_weights_from_tensor function of http_server, support for str input has already been added, but I still need to write a corresponding test.
For the EngineBase class, I think we should have a more detailed discussion on how to design and implement it.
I haven't yet merged the current test file with the verl_engine test file.

@jhinpan

yangky11 · 2025-04-04T14:46:23Z

python/sglang/srt/entrypoints/http_server_engine.py

+                    HttpSerializer.serialize(
+                        MultiprocessingSerializer.serialize(named_tensors)
+                    )
+                    for _ in range(self.server_args.tp_size)


I'm wondering if here we can call the serialization only once.

I remember MultiprocessingSerializer.serialize(named_tensors) can't be posted by HTTP?

I meant something like

x = HttpSerializer.serialize(MultiprocessingSerializer.serialize(named_tensors)) response = requests.post(self._url("update_weights_from_tensor"), json={"serialized_named_tensors": [x for _ in range(self.server_args.tp_size)] ...

Oh, sure! I will fix it now.

Or we can just send a single copy of HttpSerializer.serialize(MultiprocessingSerializer.serialize(named_tensors)) to reduce the HTTP payload size? The server can make multiple copies after receiving it.

The size of serialized tensors is really small, so I think it will not take a long time. Also, it seems that the update_weight_from_tensor entry point doesn't have access to the tp size.

fzyzcjy · 2025-04-09T12:37:16Z

One way may be

# HttpServerEngineAdapter
def update_weights_from_tensor(self, named_tensors, ...):
    send_the_http_request(dict(
		serialized_named_tensors=[MultiprocessingSerializer.serialize(named_tensors, output_str=True) for _ in range(tp_size)],
		...
    ))

# http_server.py
def update_weights_from_tensor(obj: UpdateWeightsFromTensorReqInput):
    tokenizer_manager.update_weights_from_tensor(obj) # no conversion needed

with

class MultiprocessingSerializer:
    def serialize(..., output_str: bool = False):
		"""output_str: ... useful when the serialized data needs to be transfered as JSON or other bytes-incompatible scenarios"""
        ... old code ...
        if output_str: output := base64encode(output)

    def deserialize(..., data):
	    if isinstance(data, str): data := base64decode(data)
        ... old code ...

Remarks

Users who directly call raw HTTP API will write code similar to the HttpServerEngineAdapter code above
This HTTP API seems flexible and allows users to provide different tensors for different TP ranks, especially useful in multinode scenario or want to reduce cross-GPU copy
We may be able to remove LocalSerializedTensor (e.g. in a future PR) to simplify if we have this

fzyzcjy

It seems this PR is needed to be merged ASAP, so I tried my best not to review too carefully and only point out some doc or typing things that can be modified within a minute.

fzyzcjy · 2025-04-10T23:24:39Z

python/sglang/srt/entrypoints/http_server_engine.py

+class HttpServerEngineForRL(EngineBase):
+    def __init__(self, **kwargs):
+        self.server_args = ServerArgs(**kwargs)
+        print(f"launch_server_from_verl_engine {self.server_args.port}")


yes (or at least rename the word "launch_server_from_verl_engine" which seems to be the old name of a function that we called

fzyzcjy · 2025-04-10T23:24:51Z

python/sglang/srt/entrypoints/http_server_engine.py

+        print(f"launch_server_from_verl_engine {self.server_args.port}")
+        self.process = launch_server_process(self.server_args)
+
+    def _make_request(self, endpoint: str, payload: dict = None):


Suggested change

def _make_request(self, endpoint: str, payload: dict = None):

def _make_request(self, endpoint: str, payload: Optional[dict] = None):

fzyzcjy · 2025-04-10T23:25:36Z

python/sglang/srt/entrypoints/http_server_engine.py

+        flush_cache: bool = False,
+    ):
+        """
+        Update model weights from tensor data. The HTTPS server will only post meta data, and the real weights will be copied directly from GPUs.


nit: it seems most people will use HTTP instead of HTTPS (indeed wondering whether SGLang supports https today), thus would be great to change doc

(same for other "HTTPS" words)

fzyzcjy · 2025-04-10T23:27:09Z

python/sglang/srt/managers/io_struct.py

+    - No pickle serialization is used for security reasons
+    """
+
+    serialized_named_tensors: List[str]


Suggested change

serialized_named_tensors: List[str]

serialized_named_tensors: List[Union[str, bytes]]

fzyzcjy · 2025-04-10T23:27:38Z

python/sglang/srt/managers/io_struct.py

+
+    - Binary data like tensors are base64 encoded
+    - Data is structured in JSON for easy transmission over HTTP
+    - No pickle serialization is used for security reasons


well there is pickle indeed... (to serialize torch.Tensors)

fzyzcjy · 2025-04-10T23:28:12Z

python/sglang/srt/managers/io_struct.py

-    flush_cache: bool
+    """Update model weights from tensor input.
+
+    - Binary data like tensors are base64 encoded


it is not base64 encoded when this object is created from Engine...

fzyzcjy · 2025-04-10T23:28:47Z

test/srt/test_verl_engine_server.py

(a bit confused)

jhinpan · 2025-04-11T00:30:40Z

cc @fzyzcjy . Shout out to your great help and guidance! I quickly made the changes as you reviewed and suggested. Please let me know if there is anything left that we need to take care and discuss with!

zhaochenyang20

Great！

fzyzcjy · 2025-04-12T00:21:07Z

If the PR is in emergency then I have no big issues about it.

Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>

* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <ustcsqq@gmail.com> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <cwan39@gatech.edu> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <streetyao@live.com> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <wunhuang@amd.com> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: shangmingc <csmthu@gmail.com> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <me@zhyncs.com> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <cwan39@gatech.edu> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <ispobaoke@163.com> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <me@zhyncs.com> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yun Dai <yundai424@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: grimoire <streetyao@live.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Teng Ma <805522925@qq.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Yusong Gao <yusong.gao@icloud.com> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: tianlian yi <91449279+yitianlian@users.noreply.github.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: yulei <yuulei12@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Yangcheng Li <bluebluelitchi@hotmail.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com>

yitianlian and others added 4 commits March 27, 2025 12:24

update http_server_engine and part of the test

10b7227

Add other 3 APIs

e79d178

update http_server_engine and test

cd89cc3

Merge branch 'main' into feature/http_server_engine

7e71b16

yitianlian mentioned this pull request Mar 28, 2025

veRL-SGLang Roadmap zhaochenyang20/Awesome-ML-SYS-Tutorial#74

Open

13 tasks

zhaochenyang20 and others added 2 commits March 27, 2025 22:39

Merge branch 'main' into feature/http_server_engine

99be7b2

Merge branch 'main' into feature/http_server_engine

7c3202d

Merge branch 'main' into feature/http_server_engine

87302d3

zhaochenyang20 marked this pull request as ready for review April 2, 2025 03:54

zhaochenyang20 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners April 2, 2025 03:54

fzyzcjy self-assigned this Apr 4, 2025

fzyzcjy reviewed Apr 4, 2025

View reviewed changes

yitianlian added 2 commits April 4, 2025 14:06

revise most of problems in comments

20be928

revise most of problems in comments

7b5dfae

yangky11 reviewed Apr 4, 2025

View reviewed changes

reduce the serialize number

181030c

Refactoring Code Structure

db3937e

jhinpan requested a review from xiezhq-hermann as a code owner April 9, 2025 01:23

jhinpan and others added 4 commits April 8, 2025 20:54

Merge branch 'main' into feature/http_server_engine

0963bcd

For Sync

c266d4a

Revert MP in Engine

dca2e96

Merge branch 'main' into feature/http_server_engine

dd4ac15

yitianlian and others added 5 commits April 9, 2025 14:40

update method of updating weights

d38ea8d

Merge branch 'main' into feature/http_server_engine

e148a50

Merge branch 'main' into feature/http_server_engine

5f77d4b

update name

99dcc14

Merge branch 'main' into feature/http_server_engine

ae05db5

fzyzcjy reviewed Apr 10, 2025

View reviewed changes

zhaochenyang20 and others added 2 commits April 10, 2025 17:01

Merge branch 'main' into feature/http_server_engine

e78bdfe

Quick fix for review

ae2130b

jhinpan and others added 3 commits April 11, 2025 00:32

One other HTTP clarification

128def0

update doc

59992df

Merge branch 'main' into feature/http_server_engine

78542c9

zhaochenyang20 approved these changes Apr 11, 2025

View reviewed changes

Merge branch 'main' into feature/http_server_engine

8f95856

Merge branch 'main' into feature/http_server_engine

eea5eec

zhaochenyang20 approved these changes Apr 12, 2025

View reviewed changes

zhyncs merged commit bc92107 into sgl-project:main Apr 12, 2025
23 checks passed

		return base64.b64encode(pickled).decode("utf-8")


		import dataclasses

	def _make_request(self, endpoint: str, payload: dict = None):
	def _make_request(self, endpoint: str, payload: Optional[dict] = None):

	serialized_named_tensors: List[str]
	serialized_named_tensors: List[Union[str, bytes]]

Support server based rollout in Verlengine #4848

Support server based rollout in Verlengine #4848

Uh oh!

Conversation

yitianlian commented Mar 28, 2025 • edited by zhaochenyang20 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 commented Mar 28, 2025

Uh oh!

yangky11 commented Apr 3, 2025

Uh oh!

zhaochenyang20 commented Apr 3, 2025

Uh oh!

yitianlian commented Apr 3, 2025

Uh oh!

yitianlian commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangky11 commented Apr 3, 2025

Uh oh!

jhinpan commented Apr 4, 2025

Uh oh!

fzyzcjy commented Apr 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yitianlian commented Apr 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yitianlian commented Mar 28, 2025 •

edited by zhaochenyang20

Loading

yitianlian commented Apr 3, 2025 •

edited

Loading

fzyzcjy commented Apr 9, 2025 •

edited

Loading