Skip to content

[Feat] support mixed ep #2969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 25, 2025
Merged

Conversation

Wanglongzhi2001
Copy link
Contributor

@Wanglongzhi2001 Wanglongzhi2001 commented Jul 22, 2025

Support mixed ep.
TPOT, OTPS, speed of decode are all improved to some extent at different bsz.

  • Average input length: 1018
  • Average output length: 384
  • model: ERNIE-4_5-300B-A47B-FP8-Paddle
  • Version of paddle: 3.0.1
image

Copy link

paddle-bot bot commented Jul 22, 2025

Thanks for your contribution!

@@ -959,6 +959,11 @@ class at the server level, which is too granular for ModelRunner.
We plan to replace it with 'ModelForwardBatch'.
intermediate_tensors:
"""
is_decode_batch = paddle.to_tensor(not ((self.share_inputs["seq_lens_this_time"]
> 1).sum() > 0))
paddle.distributed.broadcast(is_decode_batch, src=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是broadcast的逻辑,而且src也不一定是0,这里是任意一张卡是False,其他卡必须是False,只有所有卡是True才能是True

@Wanglongzhi2001 Wanglongzhi2001 force-pushed the mixed_ep branch 2 times, most recently from 9a5badc to 34ebf2a Compare July 24, 2025 10:02
num_max_dispatch_tokens_per_rank,
None,
num_experts,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段代码之前被删了,但是我发现3.0.1的paddle这里还是会报错,而Fastdeploy的使用文档里推荐用户使用的paddle版本还是3.0.1,所以从用户使用角度考虑先留着

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也需要适配develop 下一个PR这里都兼容一下吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@gongshaotian
Copy link
Collaborator

Batch 越大加速越明显吗

@Wanglongzhi2001
Copy link
Contributor Author

Batch 越大加速越明显吗

理论上是的,因为单机EP现在默认走的是DeepEP的prefill模式,加上混合EP后可以用上DeepEP的low latency模式(decode),bsz越大加速可能更明显一点。

num_max_dispatch_tokens_per_rank,
None,
num_experts,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也需要适配develop 下一个PR这里都兼容一下吧

@RichardWooSJTU RichardWooSJTU merged commit 0700c90 into PaddlePaddle:develop Jul 25, 2025
8 of 9 checks passed
elif moe_phase == MoEPhase.PREFILL:
self.deepep_engine = deep_ep.Buffer(
# prefill engine
self.prefill_deepep_engine = deep_ep.Buffer(
self.group,
int(5e8),
0,
low_latency_mode=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed模式下,这个low_latency_mode不应该设置成True吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants