-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Register Fp4 allgather with NCCL symmetric memory #9358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -1306,7 +1320,15 @@ def apply( | |||
tune_max_num_tokens=next_power_of_2(x.shape[0]), | |||
)[0] | |||
if moe_runner_config.routed_scaling_factor is not None: | |||
output *= moe_runner_config.routed_scaling_factor | |||
with use_symmetric_memory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks for catching that, let me do that.
f8c5f13
to
66b3a8c
Compare
with use_symmetric_memory( | ||
get_tp_group(), disabled=not is_max_padding() | ||
) as sm: | ||
symm_output = torch.empty_like(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When x is quantized this will have the wrong dtype and shape (since hidden size will be half). We will want to use output_dtype and maybe store original x_col
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I fixed that in #8934 . Thanks.
Motivation
5% e2e perf gain
Depends on #8934
After this PR:
Before:
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist