Skip to content

Conversation

mickqian
Copy link
Collaborator

@mickqian mickqian commented May 2, 2025

Motivation

Previously, for gpu-tensor, the hash process required by multimodal model will first:

  1. move the tensor D2H
  2. then hash the tensor with sha256

With some simple profiling, hashing a normal image feature (e.g., from sgl-logo image, shape=[3312,1176], dtype=float32 in qwen-vl cases) would cost ~80ms.

Modifications

  1. Add a triton hash kernel, which does:
    1. hash the blocks parallelly on GPU
    2. reduce the block results sequentially on CPU

Profiling

Hash performance

time(1000 times)
original hash 11s
triton hash ~0.6s

MMMU

accuracy(before) time(before) accuracy(after) time(after)
Gemma-3-4b-it 0.384 226.2 0.384 218.3
Qwen2.5-VL-7B-Instruct 0.467 352.7 0.467 338.2
Minicpmv 0.436 232.5 0.436 230.7

Correctness

  1. Consistency: The kernel first hash tensor with blocks, then performs a reduce on block results sequentially.
  2. Collision: The collision probability is zero, on 10000 randomly-generated tensors with the same shape as real data, with the following script:
def test_hash_collision(hasher, name, num_tensors=10000, tensor_shape=(128,)):
    hashes = set()
    collision_count = 0

    for i in range(num_tensors):
        tensor = torch.rand(
         size=tensor_shape, dtype=torch.float32, device="cuda"
        ) * 2 - 1
        h = hasher(tensor)
        if h in hashes:
            collision_count += 1
        else:
            hashes.add(h)

    print(f"hasher: {name}")
    print(f"Tested {num_tensors} random tensors of shape {tensor_shape}")
    print(f"Collisions found: {collision_count}")
    print(f"Collision rate: {collision_count / num_tensors:.6f}")

Future Work

  1. Should we just implement MurmurHash/xxHash directly?

Checklist

@mickqian mickqian requested a review from BBuf as a code owner May 17, 2025 03:15
@mickqian mickqian changed the title vlm: speed up gpu-tensor hash vlm: tensor hash kernel May 17, 2025
@JustinTong0323
Copy link
Collaborator

image a typo is found...

@zhyncs zhyncs merged commit 626ccb7 into sgl-project:main May 18, 2025
0 of 6 checks passed
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
@@ -222,7 +223,8 @@ def tensor_hash(tensor_list) -> int:
for x in tensor_list
]
tensor = torch.concat(tensor_list)

if tensor.is_cuda:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will a tensor be on GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the fast version of the processor is enabled, the returned tensor will be on GPU be default, seehere

)

# TODO: threads can't be synced on triton kernel
final_hash = intermediate_hashes.sum().item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum is not a good combinator for hash function

@merrymercy
Copy link
Contributor

This is a very bad hash function! @yizhang2077 @zhyncs @mickqian

@merrymercy
Copy link
Contributor

related links:
NVIDIA/TensorRT-LLM#4145
pytorch/pytorch#2569

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants