[Feature] : add mm input item cache support #8742

kousakawang · 2025-08-04T02:32:24Z

Motivation

Modifications

related to issue: #8433

Add an image input data cache for the qwen2-image-processor
If an image has already been processed, it will not be processed again. Instead of sending raw torch tensor data, a hash key will be sent .
This feature is primarily intended for AI agent input scenarios, where multiple images are typically provided as input, but only one of them is new data.

Set env_var SGL_CACHE_MM_IMAGE =1 to open this feature.
Note: This feature should not be enabled in scenarios where only a single image is input each time, and the duplication rate between all input images is very low.

Benchmark & Profiling

Ut for this Feature

python3 -m unittest test_vision_openai_server_a.TestQwen2_5_AI_AGENT.test_ai_agent_opt

test_result:

Accuray: (no change with Feature On/Off）
Action_Summary: yes, I <--> Action_Summary: yes, I

Performance:
opt : 8226.41 ms
normal: 9450.75 ms

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @kousakawang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization for multimodal input processing, particularly for image data used with the Qwen2-image-processor. It implements a caching mechanism that stores already processed image features, allowing the system to avoid re-processing and re-transferring redundant data. This enhancement is designed to improve efficiency in AI agent scenarios where multiple images might be provided, but only a subset are truly new, by intelligently managing and reconstructing the full input from a mix of cached and fresh data.

Highlights

Image Data Caching: Introduces a FIFO (First-In, First-Out) cache for processed image features, specifically optimized for the Qwen2-image-processor, to prevent redundant processing of identical images.
Efficient Data Handling: Modifies multimodal data processing to send only hash keys for cached images and actual tensor data for new images, significantly reducing data transfer overhead between components.
Dynamic Input Reconstruction: Enables the system to intelligently reconstruct full multimodal inputs by combining cached image features with newly processed data, based on flags and hash keys, ensuring the model receives complete input.
Token Stream Adaptation: Adjusts the input token sequence to correctly represent both newly processed and cached images within the model's input stream, maintaining proper context for the language model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for multimodal inputs to avoid reprocessing images, which is a great addition for performance in certain use cases. The implementation is a good start, but I've found a few critical and high-severity issues related to correctness and potential runtime errors that need to be addressed. These include an uninitialized cache attribute, and bugs in the utility functions for cache management and string operations. I've also left some comments on maintainability and code clarity. Please review the detailed comments.

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/multimodal/processors/base_processor.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…/sglang into add_mm_item_cache

merge

gemini-code-assist · 2025-08-06T02:45:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-06T02:45:29Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-06T02:45:34Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-06T02:45:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-06T02:45:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-06T02:45:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-06T02:45:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

byjiang1996 · 2025-08-19T05:37:41Z

python/sglang/srt/multimodal/processors/base_processor.py

+                    cached_tensor = self.hash_table.get(img_hash_keys[img_idx])
+                    assert isinstance(
+                        cached_tensor, torch.Tensor
+                    ), "invalid cached_tensor"
+                    # tensor_lists.append(cached_tensor)


Is there any reason why self.hash_table's value is not used but eventually use the value from mminput_cache?

It seems self.hash_table alone would suffice

As this comment https://github.com/sgl-project/sglang/pull/8742/files#r2284939106 said:
mminput_cache was manged by scheduler subprocess,
and self.hash_table was managed by tokenizer subprocess.
Inter-process communication is done through ICP (each process uses a different port) via the ZMQ library.
so we cache the image data in tokenizer process aims at :

avoiding processing the same image again,

and also saving the data-send cost between tokenizer process and scheduler process.

To avoid sending a pixel-values(torch.tensor) again, we need to established mm_cache in scheduler process, also. (and this can be a gpu tensor cache)
By doing so , for a normal image, the normal processing route is GPU-> cpu ->send via ZMQ -> cpu ->GPU
for a cached image, we can process it via raw_image or bas64 -> send hash -> getGPU tensor to avoid send any real pixel values or do any D2H H2D.

Actually, I think the IPC-handle is a better way to avoid sending data but share gpu memory between differnt process, ( in this case, cache is not necessary)
But since get ipc-handle will not increase the ref count of torch's tensor, so without keeping all data always active (by a global container or someting) , access tensor via ipc-handle in scheduler process will have some risk. (also, ipchandle can only avoiding the send cost, but not the image processing cost)

byjiang1996 · 2025-08-19T05:38:02Z

python/sglang/srt/managers/schedule_batch.py

@@ -1290,6 +1292,40 @@ def prepare_for_extend(self):
                pixel_values = getattr(mm_item, "feature", None)
                if isinstance(pixel_values, torch.Tensor):
                    mm_item.feature = pixel_values.to(self.device, non_blocking=True)
+                elif isinstance(pixel_values, dict):


Is there any specific reason why this logic cannot be moved over into base_processor.py?

I think in base_processor.py, we just load the image (raw or base64) and convert it to tensor(which will be accepted by model as input) , this action happen in tokenizer process. After this tokenizer process will send data to scheduler process by using ZMQ. As a result, if we only cache data in tokenizer process, the scheduler can not get the correct pixel values.

byjiang1996 · 2025-08-19T05:38:33Z

python/sglang/srt/multimodal/processors/base_processor.py

+                    new_processed_img_idxes.append(img_idx)
+                    new_processed_imgs.append(images[img_idx])
+                else:
+                    remove_image_idx.append(img_idx)


Is there any specific reason why cache is used only once? (not sure if i understood correctly)

Not sure I undrerstand your question correctly. If you mean why do “remove_image_idx.append(img_idx)” here， this is for rocording which image was alredy cached, and then the coresponding image token will be removed from input text to avoiding processsing image data err(ths input_txt should hold same image nums which is equal to len(new_processed_imgs)

byjiang1996 · 2025-08-19T05:39:57Z

python/sglang/srt/multimodal/processors/base_processor.py

+            proxy_pixel_values = {}
+            proxy_pixel_values["send_data"] = send_pixel_values
+            proxy_pixel_values["use_cache_mark"] = use_cache_mark
+            proxy_pixel_values["hash_keys"] = img_hash_keys
+            proxy_pixel_values["img_heights"] = processed_img_heights
+            proxy_pixel_values["delete_keys"] = delete_keys


Similar to this comment, I'm wondering why we need proxy_pixel_values but can't set pixel_values correctly within base_processor.py

This is because we can not modify the result["pixel_value"] which is processed via transformer's processor.

byjiang1996 · 2025-08-19T06:00:09Z

python/sglang/srt/multimodal/processors/base_processor.py

+                    # tensor_lists.append(cached_tensor)
+                    insert_cached_ids = (
+                        [v_start_token]
+                        + img_token_nums[img_idx] * [img_pad_token]


I'm wondering if there is any result['xx'] field that can be used directly to obtain img_token_num instead of calculating img_token_num manually

byjiang1996 · 2025-08-19T06:01:40Z

python/sglang/srt/multimodal/processors/base_processor.py

+                    start_height = end_height
+                    end_height = start_height + img_height
+                    to_cache_tensor = result["pixel_values"][start_height:end_height]


Similar comments as https://github.com/sgl-project/sglang/pull/8742/files#r2284154152

Is there any result['xx'] field to use instead of calculating img_height/start_height/end_height manually?

yes, I think we can store the height of each image in to a struct and next time we can use it directly. But for an new image , still we need to calculate it. just because each image hold different size and we can not use a fixed value. (The more imporant reason is that , we can not modify the logic in "processor.call" which was realized in transformers not sglang...)

byjiang1996 · 2025-08-19T06:03:00Z

python/sglang/srt/multimodal/processors/base_processor.py

-            ):
-                result[feature_name] = result[feature_name].to("cpu")
+
+        if cache_mm_image_items and is_qwen2_processor:


Is this logic universal to most VLMs? (seems not from my limited understanding)

If no, could you please move this logic into qwen vl processor's processor adapter similar to what we did for phi4-mm? That way this model specfic code can be moved over to model.py instead of in common code places

yes, you are right, this logic just for Qwen2VLImageProcessorFast, I will try to move code to qwen2.

byjiang1996 · 2025-08-19T06:05:59Z

python/sglang/srt/multimodal/processors/base_processor.py

+    return int.from_bytes(hash_bytes, byteorder="big")
+
+
+class FIFOHashTable:


Similar comments as https://github.com/sgl-project/sglang/pull/8742/files#r2284122880

If we can use cache more than once, we should consider @lru_cache i think

[Feat]: add mm input item cache support

d5a926f

kousakawang requested review from mickqian, JustinTong0323, merrymercy, Ying1123, hnyls2002 and xiezhq-hermann as code owners August 4, 2025 02:32

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

wanghanpei and others added 10 commits August 4, 2025 02:43

[Feat]: add mm input item cache support

3d375ed

Update python/sglang/srt/multimodal/processors/base_processor.py

b230cdf

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Fix] gemini-code-assist instructions

845f53c

Merge branch 'add_mm_item_cache' of https://github.com/bytedance-iaas…

f82123b

…/sglang into add_mm_item_cache

[Fix] gemini-code-assist instructions

26a9ae1

token

21de270

remove hash conflict risk:

15ae6ef

apply format

1b434f0

modify control env_var name:

041161e

Merge branch 'main' into add_mm_item_cache

f8e5707

merge

JustinTong0323 self-assigned this Aug 5, 2025

kousakawang added 3 commits August 6, 2025 10:46

Merge branch 'main' into add_mm_item_cache

c41eeea

Merge branch 'main' into add_mm_item_cache

dbfa487

Merge branch 'main' into add_mm_item_cache

122a61e

kousakawang and others added 6 commits August 11, 2025 11:04

Merge branch 'main' into add_mm_item_cache

03d7221

Merge branch 'main' into add_mm_item_cache

73e4045

Merge branch 'main' into add_mm_item_cache

f7f239c

unify logic for height compute

9b801e6

Merge branch 'main' into add_mm_item_cache

ed32d93

Merge branch 'main' into add_mm_item_cache

867b2ab

byjiang1996 reviewed Aug 19, 2025

View reviewed changes

kousakawang added 2 commits August 19, 2025 20:54

Merge branch 'main' into add_mm_item_cache

875fd31

Merge branch 'main' into add_mm_item_cache

f29138a

		return int.from_bytes(hash_bytes, byteorder="big")


		class FIFOHashTable:

[Feature] : add mm input item cache support #8742

Are you sure you want to change the base?

[Feature] : add mm input item cache support #8742

Conversation

kousakawang commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark & Profiling

Ut for this Feature

test_result:

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

byjiang1996 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

byjiang1996 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kousakawang commented Aug 4, 2025 •

edited

Loading

byjiang1996 Aug 19, 2025 •

edited

Loading

byjiang1996 Aug 19, 2025 •

edited

Loading