-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add validation for nsys and torch profiling #5004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add validation for nsys and torch profiling #5004
Conversation
@@ -652,6 +652,14 @@ class ProfileReqInput: | |||
num_steps: Optional[int] = None | |||
activities: Optional[List[Literal["CPU", "GPU", "MEM", "CUDA_PROFILER"]]] = None | |||
|
|||
def __post_init__(self): | |||
if "CUDA_PROFILER" in self.activities and len(self.activities) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: What if self.activities is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -652,6 +652,14 @@ class ProfileReqInput: | |||
num_steps: Optional[int] = None | |||
activities: Optional[List[Literal["CPU", "GPU", "MEM", "CUDA_PROFILER"]]] = None | |||
|
|||
def __post_init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be best to add a unit test to ensure this behavior never regresses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test added.
@@ -652,6 +652,14 @@ class ProfileReqInput: | |||
num_steps: Optional[int] = None | |||
activities: Optional[List[Literal["CPU", "GPU", "MEM", "CUDA_PROFILER"]]] = None | |||
|
|||
def __post_init__(self): | |||
if "CUDA_PROFILER" in self.activities and len(self.activities) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: len(self.activities)>1
may not consider future logic like adding more activities. Thus what about
enable_torch_profiler = ...is there CPU GPU MEM...
enable_nsys = ...is there CUDA_PROFILER...
if enable_nsys and enable_torch_profiler: err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change. Please review.
652e79a
to
a70a070
Compare
@@ -30,6 +30,7 @@ class TestFile: | |||
TestFile("test_ebnf_constrained.py"), | |||
TestFile("test_fp8_kernel.py", 8), | |||
TestFile("test_embedding_openai_server.py", 36), | |||
TestFile("test_expert_distribution.py", 31), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was accidentally removed in this commit: 04e3ff6#diff-e5d098bede71ca21f8307897ae33646a035fe3bd1e0cff0aba99e7f8ac033a5fL33
Adding it back here.
nsys_profile = False | ||
if "CUDA_PROFILER" in self.activities: | ||
nsys_profile = True | ||
torch_profile = False | ||
if ( | ||
"CPU" in self.activities | ||
or "GPU" in self.activities | ||
or "MEM" in self.activities | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nsys_profile = "CUDA_PROFILER" in self.activities
torch_profile = (
"CPU" in self.activities
or "GPU" in self.activities
or "MEM" in self.activities
)
@@ -57,6 +58,7 @@ class TestFile: | |||
TestFile("test_skip_tokenizer_init.py", 72), | |||
TestFile("test_srt_engine.py", 237), | |||
TestFile("test_srt_endpoint.py", 94), | |||
TestFile("test_start_profile.py", 22), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this file seems to forget to be commited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! I see, I will add it. Thanks!
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Motivation
Torch profiling and nsys profiling are mutual exclusive. When setting up profiling through the
/start_profile
endpoint, the currentactivities
option does not block users from setting both up. Doing so will cause CUPTI error.Modifications
This PR adds a post init validation to make sure only one of the profiling is set.
Checklist