-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Feature] : add mm input item cache support #8742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kousakawang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant optimization for multimodal input processing, particularly for image data used with the Qwen2-image-processor. It implements a caching mechanism that stores already processed image features, allowing the system to avoid re-processing and re-transferring redundant data. This enhancement is designed to improve efficiency in AI agent scenarios where multiple images might be provided, but only a subset are truly new, by intelligently managing and reconstructing the full input from a mix of cached and fresh data.
Highlights
- Image Data Caching: Introduces a FIFO (First-In, First-Out) cache for processed image features, specifically optimized for the Qwen2-image-processor, to prevent redundant processing of identical images.
- Efficient Data Handling: Modifies multimodal data processing to send only hash keys for cached images and actual tensor data for new images, significantly reducing data transfer overhead between components.
- Dynamic Input Reconstruction: Enables the system to intelligently reconstruct full multimodal inputs by combining cached image features with newly processed data, based on flags and hash keys, ensuring the model receives complete input.
- Token Stream Adaptation: Adjusts the input token sequence to correctly represent both newly processed and cached images within the model's input stream, maintaining proper context for the language model.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a caching mechanism for multimodal inputs to avoid reprocessing images, which is a great addition for performance in certain use cases. The implementation is a good start, but I've found a few critical and high-severity issues related to correctness and potential runtime errors that need to be addressed. These include an uninitialized cache attribute, and bugs in the utility functions for cache management and string operations. I've also left some comments on maintainability and code clarity. Please review the detailed comments.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…/sglang into add_mm_item_cache
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
6 similar comments
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
cached_tensor = self.hash_table.get(img_hash_keys[img_idx]) | ||
assert isinstance( | ||
cached_tensor, torch.Tensor | ||
), "invalid cached_tensor" | ||
# tensor_lists.append(cached_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why self.hash_table's
value is not used but eventually use the value from mminput_cache
?
It seems self.hash_table
alone would suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this comment https://github.com/sgl-project/sglang/pull/8742/files#r2284939106 said:
mminput_cache was manged by scheduler subprocess,
and self.hash_table was managed by tokenizer subprocess.
Inter-process communication is done through ICP (each process uses a different port) via the ZMQ library.
so we cache the image data in tokenizer process aims at :
- avoiding processing the same image again,
- and also saving the data-send cost between tokenizer process and scheduler process.
To avoid sending a pixel-values(torch.tensor) again, we need to established mm_cache in scheduler process, also. (and this can be a gpu tensor cache)
By doing so , for a normal image, the normal processing route is GPU-> cpu ->send via ZMQ -> cpu ->GPU
for a cached image, we can process it via raw_image or bas64 -> send hash -> getGPU tensor to avoid send any real pixel values or do any D2H H2D.
Actually, I think the IPC-handle is a better way to avoid sending data but share gpu memory between differnt process, ( in this case, cache is not necessary)
But since get ipc-handle will not increase the ref count of torch's tensor, so without keeping all data always active (by a global container or someting) , access tensor via ipc-handle in scheduler process will have some risk. (also, ipchandle can only avoiding the send cost, but not the image processing cost)
@@ -1290,6 +1292,40 @@ def prepare_for_extend(self): | |||
pixel_values = getattr(mm_item, "feature", None) | |||
if isinstance(pixel_values, torch.Tensor): | |||
mm_item.feature = pixel_values.to(self.device, non_blocking=True) | |||
elif isinstance(pixel_values, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific reason why this logic cannot be moved over into base_processor.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in base_processor.py, we just load the image (raw or base64) and convert it to tensor(which will be accepted by model as input) , this action happen in tokenizer process. After this tokenizer process will send data to scheduler process by using ZMQ. As a result, if we only cache data in tokenizer process, the scheduler can not get the correct pixel values.
new_processed_img_idxes.append(img_idx) | ||
new_processed_imgs.append(images[img_idx]) | ||
else: | ||
remove_image_idx.append(img_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific reason why cache is used only once? (not sure if i understood correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I undrerstand your question correctly. If you mean why do “remove_image_idx.append(img_idx)” here, this is for rocording which image was alredy cached, and then the coresponding image token will be removed from input text to avoiding processsing image data err(ths input_txt should hold same image nums which is equal to len(new_processed_imgs)
proxy_pixel_values = {} | ||
proxy_pixel_values["send_data"] = send_pixel_values | ||
proxy_pixel_values["use_cache_mark"] = use_cache_mark | ||
proxy_pixel_values["hash_keys"] = img_hash_keys | ||
proxy_pixel_values["img_heights"] = processed_img_heights | ||
proxy_pixel_values["delete_keys"] = delete_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to this comment, I'm wondering why we need proxy_pixel_values
but can't set pixel_values
correctly within base_processor.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we can not modify the result["pixel_value"] which is processed via transformer's processor.
# tensor_lists.append(cached_tensor) | ||
insert_cached_ids = ( | ||
[v_start_token] | ||
+ img_token_nums[img_idx] * [img_pad_token] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is any result['xx']
field that can be used directly to obtain img_token_num instead of calculating img_token_num
manually
start_height = end_height | ||
end_height = start_height + img_height | ||
to_cache_tensor = result["pixel_values"][start_height:end_height] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comments as https://github.com/sgl-project/sglang/pull/8742/files#r2284154152
Is there any result['xx'] field to use instead of calculating img_height/start_height/end_height
manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we can store the height of each image in to a struct and next time we can use it directly. But for an new image , still we need to calculate it. just because each image hold different size and we can not use a fixed value. (The more imporant reason is that , we can not modify the logic in "processor.call" which was realized in transformers not sglang...)
): | ||
result[feature_name] = result[feature_name].to("cpu") | ||
|
||
if cache_mm_image_items and is_qwen2_processor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this logic universal to most VLMs? (seems not from my limited understanding)
If no, could you please move this logic into qwen vl processor's processor adapter similar to what we did for phi4-mm? That way this model specfic code can be moved over to model.py instead of in common code places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right, this logic just for Qwen2VLImageProcessorFast, I will try to move code to qwen2.
return int.from_bytes(hash_bytes, byteorder="big") | ||
|
||
|
||
class FIFOHashTable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comments as https://github.com/sgl-project/sglang/pull/8742/files#r2284122880
If we can use cache more than once, we should consider @lru_cache
i think
Motivation
Modifications
related to issue: #8433
Add an image input data cache for the qwen2-image-processor
If an image has already been processed, it will not be processed again. Instead of sending raw torch tensor data, a hash key will be sent .
This feature is primarily intended for AI agent input scenarios, where multiple images are typically provided as input, but only one of them is new data.
Set env_var SGL_CACHE_MM_IMAGE =1 to open this feature.
Note: This feature should not be enabled in scenarios where only a single image is input each time, and the duplication rate between all input images is very low.
Benchmark & Profiling
Ut for this Feature
python3 -m unittest test_vision_openai_server_a.TestQwen2_5_AI_AGENT.test_ai_agent_opt
test_result:
Accuray: (no change with Feature On/Off)
Action_Summary: yes, I <--> Action_Summary: yes, I
Performance:
opt : 8226.41 ms
normal: 9450.75 ms
Checklist