-
Notifications
You must be signed in to change notification settings - Fork 596
[Feat] support mixed ep #2969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] support mixed ep #2969
Conversation
Thanks for your contribution! |
@@ -959,6 +959,11 @@ class at the server level, which is too granular for ModelRunner. | |||
We plan to replace it with 'ModelForwardBatch'. | |||
intermediate_tensors: | |||
""" | |||
is_decode_batch = paddle.to_tensor(not ((self.share_inputs["seq_lens_this_time"] | |||
> 1).sum() > 0)) | |||
paddle.distributed.broadcast(is_decode_batch, src=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不是broadcast的逻辑,而且src也不一定是0,这里是任意一张卡是False,其他卡必须是False,只有所有卡是True才能是True
9a5badc
to
34ebf2a
Compare
34ebf2a
to
c81a3f9
Compare
num_max_dispatch_tokens_per_rank, | ||
None, | ||
num_experts, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段代码之前被删了,但是我发现3.0.1的paddle这里还是会报错,而Fastdeploy的使用文档里推荐用户使用的paddle版本还是3.0.1,所以从用户使用角度考虑先留着
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也需要适配develop 下一个PR这里都兼容一下吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
Batch 越大加速越明显吗 |
理论上是的,因为单机EP现在默认走的是DeepEP的prefill模式,加上混合EP后可以用上DeepEP的low latency模式(decode),bsz越大加速可能更明显一点。 |
num_max_dispatch_tokens_per_rank, | ||
None, | ||
num_experts, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也需要适配develop 下一个PR这里都兼容一下吧
elif moe_phase == MoEPhase.PREFILL: | ||
self.deepep_engine = deep_ep.Buffer( | ||
# prefill engine | ||
self.prefill_deepep_engine = deep_ep.Buffer( | ||
self.group, | ||
int(5e8), | ||
0, | ||
low_latency_mode=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mixed模式下,这个low_latency_mode
不应该设置成True
吗?
Support mixed ep.
TPOT, OTPS, speed of decode are all improved to some extent at different bsz.