Skip to content

Conversation

kousakawang
Copy link
Contributor

@kousakawang kousakawang commented Aug 4, 2025

Motivation

Modifications

related to issue: #8433

Add an image input data cache for the qwen2-image-processor
If an image has already been processed, it will not be processed again. Instead of sending raw torch tensor data, a hash key will be sent .
This feature is primarily intended for AI agent input scenarios, where multiple images are typically provided as input, but only one of them is new data.

Set env_var SGL_CACHE_MM_IMAGE =1 to open this feature.
Note: This feature should not be enabled in scenarios where only a single image is input each time, and the duplication rate between all input images is very low.

Benchmark & Profiling

Ut for this Feature

python3 -m unittest test_vision_openai_server_a.TestQwen2_5_AI_AGENT.test_ai_agent_opt

test_result:

Accuray: (no change with Feature On/Off)
Action_Summary: yes, I <--> Action_Summary: yes, I

Performance:
opt : 8226.41 ms
normal: 9450.75 ms

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kousakawang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization for multimodal input processing, particularly for image data used with the Qwen2-image-processor. It implements a caching mechanism that stores already processed image features, allowing the system to avoid re-processing and re-transferring redundant data. This enhancement is designed to improve efficiency in AI agent scenarios where multiple images might be provided, but only a subset are truly new, by intelligently managing and reconstructing the full input from a mix of cached and fresh data.

Highlights

  • Image Data Caching: Introduces a FIFO (First-In, First-Out) cache for processed image features, specifically optimized for the Qwen2-image-processor, to prevent redundant processing of identical images.
  • Efficient Data Handling: Modifies multimodal data processing to send only hash keys for cached images and actual tensor data for new images, significantly reducing data transfer overhead between components.
  • Dynamic Input Reconstruction: Enables the system to intelligently reconstruct full multimodal inputs by combining cached image features with newly processed data, based on flags and hash keys, ensuring the model receives complete input.
  • Token Stream Adaptation: Adjusts the input token sequence to correctly represent both newly processed and cached images within the model's input stream, maintaining proper context for the language model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for multimodal inputs to avoid reprocessing images, which is a great addition for performance in certain use cases. The implementation is a good start, but I've found a few critical and high-severity issues related to correctness and potential runtime errors that need to be addressed. These include an uninitialized cache attribute, and bugs in the utility functions for cache management and string operations. I've also left some comments on maintainability and code clarity. Please review the detailed comments.

@JustinTong0323 JustinTong0323 self-assigned this Aug 5, 2025
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

6 similar comments
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment on lines +457 to +461
cached_tensor = self.hash_table.get(img_hash_keys[img_idx])
assert isinstance(
cached_tensor, torch.Tensor
), "invalid cached_tensor"
# tensor_lists.append(cached_tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why self.hash_table's value is not used but eventually use the value from mminput_cache?

It seems self.hash_table alone would suffice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this comment https://github.com/sgl-project/sglang/pull/8742/files#r2284939106 said:
mminput_cache was manged by scheduler subprocess,
and self.hash_table was managed by tokenizer subprocess.
Inter-process communication is done through ICP (each process uses a different port) via the ZMQ library.
so we cache the image data in tokenizer process aims at :

  1. avoiding processing the same image again,
  2. and also saving the data-send cost between tokenizer process and scheduler process.

To avoid sending a pixel-values(torch.tensor) again, we need to established mm_cache in scheduler process, also. (and this can be a gpu tensor cache)
By doing so , for a normal image, the normal processing route is GPU-> cpu ->send via ZMQ -> cpu ->GPU
for a cached image, we can process it via raw_image or bas64 -> send hash -> getGPU tensor to avoid send any real pixel values or do any D2H H2D.

Actually, I think the IPC-handle is a better way to avoid sending data but share gpu memory between differnt process, ( in this case, cache is not necessary)
But since get ipc-handle will not increase the ref count of torch's tensor, so without keeping all data always active (by a global container or someting) , access tensor via ipc-handle in scheduler process will have some risk. (also, ipchandle can only avoiding the send cost, but not the image processing cost)

@@ -1290,6 +1292,40 @@ def prepare_for_extend(self):
pixel_values = getattr(mm_item, "feature", None)
if isinstance(pixel_values, torch.Tensor):
mm_item.feature = pixel_values.to(self.device, non_blocking=True)
elif isinstance(pixel_values, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason why this logic cannot be moved over into base_processor.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in base_processor.py, we just load the image (raw or base64) and convert it to tensor(which will be accepted by model as input) , this action happen in tokenizer process. After this tokenizer process will send data to scheduler process by using ZMQ. As a result, if we only cache data in tokenizer process, the scheduler can not get the correct pixel values.

new_processed_img_idxes.append(img_idx)
new_processed_imgs.append(images[img_idx])
else:
remove_image_idx.append(img_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason why cache is used only once? (not sure if i understood correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I undrerstand your question correctly. If you mean why do “remove_image_idx.append(img_idx)” here, this is for rocording which image was alredy cached, and then the coresponding image token will be removed from input text to avoiding processsing image data err(ths input_txt should hold same image nums which is equal to len(new_processed_imgs)

Comment on lines +485 to +490
proxy_pixel_values = {}
proxy_pixel_values["send_data"] = send_pixel_values
proxy_pixel_values["use_cache_mark"] = use_cache_mark
proxy_pixel_values["hash_keys"] = img_hash_keys
proxy_pixel_values["img_heights"] = processed_img_heights
proxy_pixel_values["delete_keys"] = delete_keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to this comment, I'm wondering why we need proxy_pixel_values but can't set pixel_values correctly within base_processor.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we can not modify the result["pixel_value"] which is processed via transformer's processor.

# tensor_lists.append(cached_tensor)
insert_cached_ids = (
[v_start_token]
+ img_token_nums[img_idx] * [img_pad_token]
Copy link
Contributor

@byjiang1996 byjiang1996 Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is any result['xx'] field that can be used directly to obtain img_token_num instead of calculating img_token_num manually

Comment on lines +438 to +440
start_height = end_height
end_height = start_height + img_height
to_cache_tensor = result["pixel_values"][start_height:end_height]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as https://github.com/sgl-project/sglang/pull/8742/files#r2284154152

Is there any result['xx'] field to use instead of calculating img_height/start_height/end_height manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we can store the height of each image in to a struct and next time we can use it directly. But for an new image , still we need to calculate it. just because each image hold different size and we can not use a fixed value. (The more imporant reason is that , we can not modify the logic in "processor.call" which was realized in transformers not sglang...)

):
result[feature_name] = result[feature_name].to("cpu")

if cache_mm_image_items and is_qwen2_processor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic universal to most VLMs? (seems not from my limited understanding)

If no, could you please move this logic into qwen vl processor's processor adapter similar to what we did for phi4-mm? That way this model specfic code can be moved over to model.py instead of in common code places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, this logic just for Qwen2VLImageProcessorFast, I will try to move code to qwen2.

return int.from_bytes(hash_bytes, byteorder="big")


class FIFOHashTable:
Copy link
Contributor

@byjiang1996 byjiang1996 Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as https://github.com/sgl-project/sglang/pull/8742/files#r2284122880

If we can use cache more than once, we should consider @lru_cache i think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants