Skip to content

[feat]support inflight quant #3277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

bukejiyu
Copy link
Collaborator

@bukejiyu bukejiyu commented Aug 8, 2025

支持动态量化loader 使用需开启 load_choices="inflight_quant" 或 --inflight_quant "inflight_quant"
可以观察到 moe系列模型 load性能有 30%左右的提升
模型支持[qwen3/qwen3moe]

模型 模型类型 精度 tpsize loader 内存峰值 load耗时 精度
qwen3 32B qwen3 wint8 4 default 4×22.7G 19.78s 逐token对齐
inflight_quant 4×7.4G 18.96s
Qwen3-235B-A22B qwen3moe wint4 4 default 116×4G 368.767s 逐token对齐
inflight_quant 4×9.2G 261.65s
Qwen3-30B-A3B qwen3moe wint4 4 default 75.764s 逐token对齐
inflight_quant 53.599s

Copy link

paddle-bot bot commented Aug 8, 2025

Thanks for your contribution!

@bukejiyu bukejiyu force-pushed the inflight_quant_loader branch from 2508956 to 270e373 Compare August 11, 2025 06:19
@bukejiyu bukejiyu closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant