-
Notifications
You must be signed in to change notification settings - Fork 596
[Excutor] Experiment-Support Prefill in cudagraph #3459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Thanks for your contribution! |
self.share_inputs["encoder_batch_ids"] = paddle.full( | ||
shape=[self.max_seq_len], fill_value=0, dtype="int32" | ||
) # gpu | ||
self.share_inputs["encoder_tile_ids_per_batch"] = paddle.full( | ||
shape=[self.max_seq_len], fill_value=0, dtype="int32" | ||
) # gpu | ||
self.share_inputs["encoder_num_blocks"] = paddle.full(shape=[1], fill_value=0, dtype="int32").cpu() # cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些 buffer 改为在 gpu_model_runner 里面管理,同时改造下 get_block_shape_and_split_kv 这个Kernel,把 encoder 相关的tensor 改为 Inplace 的实现,不然前处理的 copy_ 开销太高了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
暂时提交了一版还没验证过的草稿版,怕服务器挂代码没了
self.share_inputs["encoder_batch_ids"].copy_(temp_encoder_batch_ids, False) | ||
metadata.encoder_batch_ids = self.share_inputs["encoder_batch_ids"] | ||
|
||
self.share_inputs["encoder_tile_ids_per_batch"].copy_(temp_encoder_tile_ids_per_batch, False) | ||
metadata.encoder_tile_ids_per_batch = self.share_inputs["encoder_tile_ids_per_batch"] | ||
|
||
self.share_inputs["encoder_num_blocks"].copy_(temp_encoder_num_blocks, False) | ||
metadata.encoder_num_blocks = self.share_inputs["encoder_num_blocks"] | ||
|
||
self.share_inputs["kv_batch_ids"].copy_(temp_kv_batch_ids, False) | ||
metadata.kv_batch_ids = self.share_inputs["kv_batch_ids"] | ||
|
||
self.share_inputs["kv_tile_ids_per_batch"].copy_(temp_kv_tile_ids_per_batch, False) | ||
metadata.kv_tile_ids_per_batch = self.share_inputs["kv_tile_ids_per_batch"] | ||
|
||
self.share_inputs["kv_num_blocks"].copy_(temp_kv_num_blocks, False) | ||
metadata.kv_num_blocks = self.share_inputs["kv_num_blocks"] | ||
|
||
self.share_inputs["max_len_kv"].copy_(temp_max_len_kv, False) | ||
metadata.max_len_kv = self.share_inputs["max_len_kv"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_ 开销太高了
if int(paddle.max(self.share_inputs["seq_lens_decoder"])) > 0: | ||
return 1 | ||
else: | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
写成一行吧,有点丑
@@ -561,7 +571,7 @@ def _dummy_prefill_inputs(self, num_tokens: int, batch_size: int, expected_decod | |||
if self.fd_config.parallel_config.enable_expert_parallel: | |||
full_length = min(full_length, 32) | |||
|
|||
input_length = int(full_length * self.cache_config.kv_cache_ratio) | |||
input_length = int(full_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删了这行
|
||
self.forward_meta.step_use_cudagraph = ( | ||
only_decode_use_cudagraph = ( | ||
self.use_cudagraph | ||
and only_decode_batch | ||
and not (prefill_exists if prefill_exists is not None else self.exist_prefill()) | ||
) | ||
|
||
# Update Batch type for cuda graph for only_prefill_batch | ||
only_prefill_batch = True | ||
decode_exists = None | ||
if self.fd_config.parallel_config.use_ep and self.fd_config.parallel_config.splitwise_role == "mixed": | ||
# 收集所有 worker 的状态 | ||
only_prefill_batch_list = [] | ||
decode_exists = self.exist_decode() | ||
paddle.distributed.all_gather_object(only_prefill_batch_list, not decode_exists) | ||
only_prefill_batch = all(only_prefill_batch_list) | ||
|
||
only_prefill_use_cudagraph = ( | ||
self.use_cudagraph | ||
and self.cudagraph_only_prefill | ||
and only_prefill_batch | ||
and not (decode_exists if decode_exists is not None else self.exist_decode()) | ||
) | ||
|
||
# When support capture both prefill-only and decode-only, this will use [only_prefill_use_cudagraph or only_decode_use_cudagraph] | ||
self.forward_meta.step_use_cudagraph = ( | ||
only_prefill_use_cudagraph if self.cudagraph_only_prefill else only_decode_use_cudagraph | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
分支太多了,Only Decode 和 Only Prefill 的能不能合并或者封装下
@@ -1230,7 +1262,7 @@ def _update_chunked_prefill(self, tasks): | |||
self.proposer.update_task_chunk_prefill(task) | |||
task.chunk_idx += 1 | |||
|
|||
def capture_model(self) -> None: | |||
def capture_model(self, capture_prefill: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capture_prefill 变量命名有歧义 -> only prefill 或 PurePrefill
logger.info( | ||
f"Warm up the model with the batch size:{batch_size}, expected_decode_len:{expected_decode_len}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger 没改
for num_tokens in sorted(capture_sizes, reverse=True): | ||
self._dummy_run( | ||
num_tokens=num_tokens, | ||
batch_size=1, | ||
in_capturing=True, | ||
expected_decode_len=expected_decode_len, | ||
) | ||
logger.info( | ||
f"Warm up the model with the num_tokens:{num_tokens}, expected_decode_len:{expected_decode_len}" | ||
) | ||
else: | ||
for batch_size in sorted(capture_sizes, reverse=True): | ||
self._dummy_run( | ||
num_tokens=self.parallel_config.max_num_batched_tokens, | ||
batch_size=batch_size, | ||
in_capturing=True, | ||
expected_decode_len=expected_decode_len, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
俩循环合并了吧,只有batch_size 不一样
print("传递给model的seq_lens_this_time", self.forward_meta.seq_lens_this_time) | ||
print("input_ids", self.forward_meta.input_ids.shape) | ||
print("self.share_inputs[ids_remove_padding].shape:", self.share_inputs["ids_remove_padding"].shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print 没删
目前支持Prefill-Only的batch进cudagraph。在确定graph可以共用之前,只能选择要么capture decode-only的,要么capture prefill-only。
1.想要开启,需要使用以下参数启动,重点是use_cudagraph和cudagraph_only_prefill都设为True
2.在当前动态插入的背景下,假设发送4个80 tokens的prompt,那么seq_lens_this_time第一轮是[80],第二轮是[1, 80, 80, 80],很明显只有第一轮是纯P,可以进cudagraph,第二轮就是MIX了,进不了cudagraph,可以通过修改fastdeploy/engine/engine.py中的函数_insert_task_to_worker中
改为
这样就是不开启动态插入的逻辑,需要等待8个prompt来(数字可更改),这8个prompt才会一起进入prefill(多个prompt纯P加速),一起进入decode。
3.在fastdeploy/config.py的init_with_cudagrpah_size中,512为capture prefill时最大capture size,可以手动更改。
TODO:buffer_size的大小需要进一步确认。