-
Notifications
You must be signed in to change notification settings - Fork 596
Add custom op declaration for all_reduce
#3473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Inputs({"inp", | ||
"out"}) | ||
.Outputs({"new_out"}) | ||
.Attrs({"_fa: int64_t", "_reg_buffer: int64_t", "reg_buffer_sz_bytes: int64_t"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些名字感觉有点奇怪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实,但他原来就叫这个名字🤔
|
||
except: | ||
tensor_model_parallel_all_reduce = None | ||
@paddle.jit.marker.unified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前的try是为了适配rl,为啥去掉呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里能解释下为什么 def 一个函数是有可能报错的么?以我浅薄的认知,这里不会有问题,如果有问题麻烦提供下报错呢,按我理解一定会有更合适的处理方式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是因为旧版本Paddle版本没有 paddle.jit.marker.unified
才加的 try except 语句呀
或者可以这样:
if hasattr(paddle.jit, "marker") and (hasattr, paddle.jit.marker, "unified"):
mark_as_unified = paddle.jit.marker.unified
else:
# do-nothing for PaddlePaddle 3.1-
mark_as_unified = lamdba fn: fn
tensor_model_parallel_all_reduce = mark_as_unified(tensor_model_parallel_all_reduce)
Deepseek V3 开启SOT动转静推理后,存在这样一个打断:
排查后发现,目前的
all_reduce
只是 cpp_extensions (PyCapsule表明是C++拓展),而不是 Paddle 自定义算子,所以需要给all_reduce
添加自定义算子声明:另外,由于自定义算子要求 Tensor 参数在前,其他 attr 参数在后,所以也调整了 all_reduce 的参数顺序
另外由于SOT的dataclass目前存在这样的问题——直接这样设置会存在属性设置不生效的问题:
所以提前设置一下,后续SOT会修复这个问题:
cc @SigureMo @zyfncg