Skip to content

Conversation

p12tic
Copy link
Contributor

@p12tic p12tic commented May 5, 2025

Motivation

In setups which have long inactivity periods it is desirable to reduce system power consumption when sglang does nothing. This would lead not only to power savings, but also to more CPU thermal headroom when a request eventually comes. This is important in cases when multiple GPUs are connected as each GPU would otherwise pin one thread at 100% CPU usage.

Primary use case is residential or small commercial users serving LLMs to a very small number of users.

FIXES #1730.

This PR includes a simplier alternative to #1731. There's less risk of adverse performance impact because the affected code runs when server is idle.

The reason for #1731 closure was specified as in-progress refactoring that would resolve the performance problem by itself. Because half a year passed since that time, I'm bringing up a fix again with the hope that maybe it will be deemed acceptable this time.

Modifications

The proposed solution is to use zmq.Poller on all sockets that may receive data that needs handling immediately. Doing this on idle should have few risks with little latency impact.

Checklist

@xiezhq-hermann
Copy link
Collaborator

I think this is neat, can you make it an sys arg options instead of default behavior?
@zhyncs @merrymercy @ByronHsu what do you think?

@zacksiri
Copy link

zacksiri commented May 7, 2025

Looking forward to having this merged.

@p12tic p12tic force-pushed the sleep-on-idle branch 3 times, most recently from 84f405e to f853f9d Compare May 7, 2025 14:50
@p12tic p12tic requested a review from zhaochenyang20 as a code owner May 7, 2025 14:50
@p12tic
Copy link
Contributor Author

p12tic commented May 7, 2025

@xiezhq-hermann thanks for review. I added --sleep-on-idle to "Other runtime options" server options.

@p12tic
Copy link
Contributor Author

p12tic commented May 13, 2025

@xiezhq-hermann Just a friendly ping :-)

@xiezhq-hermann
Copy link
Collaborator

The change looks good to me, and I just fixed a minor lint problem.
Thanks a lot for the contribution and sorry for the delay : )

@merrymercy merrymercy added the ready-to-merge The PR is ready to merge after the CI is green. label May 16, 2025
@p12tic
Copy link
Contributor Author

p12tic commented May 28, 2025

@merrymercy Just a friendly ping.. This PR has been in ready-to-merge status for 2 weeks.

p12tic added 2 commits June 12, 2025 15:39
In setups which have long inactivity periods it is desirable to reduce
system power consumption when sglang does nothing. This would lead not
only to power savings, but also to more CPU thermal headroom when a
request eventually comes. This is important in cases when multiple GPUs
are connected as each GPU would otherwise pin one thread at 100% CPU
usage.

The simplest solution is to use zmq.Poller on all sockets that may
receive data that needs handling immediately.
@p12tic
Copy link
Contributor Author

p12tic commented Jun 12, 2025

Rebased.

@p12tic
Copy link
Contributor Author

p12tic commented Jun 12, 2025

@merrymercy @xiezhq-hermann Just a friendly ping again :) Anything I could do to make this PR to land? It has been in "approved, ready to merge" status for almost a month now.

@zhyncs zhyncs merged commit bd7cfbd into sgl-project:main Jun 12, 2025
64 of 72 checks passed
@xiezhq-hermann
Copy link
Collaborator

@p12tic sorry for the delay due to some coordination problem on our side and thanks again for your work and patience : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] 100% CPU Usage When Idle in sglang
6 participants