Skip to content

[Excutor] Increase buffer size to prevent address corruption; add forward metadata debug tool #3404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 18, 2025

Conversation

littledgg
Copy link
Contributor

@littledgg littledgg commented Aug 14, 2025

0.背景为调试时经常需要通过观察ForwardMeta来辅助判断,所以增加了打印ForwardMeta的工具,方便观察地址,shape,设备类型等,简单更改后可以直接打出ForwardMetadata中的Tensor内容,方便调试。有效的处理了列表包含Tensor,成员变量的成员变量包含Tensor等嵌套情况,
以下为一个打印结果
image
为了这部分新增功能的覆盖率,增加了一个单元测试test/model_executor/test_forward_meta_str.py,同时单元测试环境中scripts/unittest_requirement.txt增加了一个依赖包partial_json_parser,原本就存在部分测试由于依赖这个包而单元测试环境中没有而失败。

1.ForwardMetadata相关的变量batch_id_per_token,cu_seqlens_q,cu_seqlens_k虽然使用了copy_固定地址,但是由于buffer分配的不够大,导致拷贝的变量shape更大时会重新分配地址,使得用copy_固定地址失去了意义,通过一开始就申请更大的Buffer来解决。

2.ForwardMetadata相关变量kv_num_blocks在prefill时是cpu tensor,在decode时为gpu tensor。在cuda层面其明确了是cpu tensor,发现原因为decode分支中设置place错误,先修改decode分支也为cpu tensor。虽然目前没有影响到程序的正确性但在python层如果对kv_num_blocks进行了一些设备类型敏感的操作会导致程序挂掉并且难以排查原因。

3.AppendAttentionMetadata中的以下相关变量

metadata.encoder_batch_ids,
metadata.encoder_tile_ids_per_batch,
metadata.encoder_num_blocks,
metadata.kv_batch_ids,
metadata.kv_tile_ids_per_batch,
metadata.kv_num_blocks,
metadata.max_len_kv,

也存在地址变动,其中一些与prefill阶段有关,为了适配将来prefill进Cudagraph,这个问题也需要解决,目前代码已实现,预计会和改动后的Cudagraph一起合并。

Copy link

paddle-bot bot commented Aug 14, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Aug 14, 2025
gongshaotian
gongshaotian previously approved these changes Aug 15, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

yuanlehome
yuanlehome previously approved these changes Aug 15, 2025
@littledgg littledgg dismissed stale reviews from yuanlehome and gongshaotian via 9d363a9 August 15, 2025 08:01
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit ea4a3b4 into PaddlePaddle:develop Aug 18, 2025
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants