Tensor size mismatch for inference on BLIP.

My image:

![Image](https://github.com/user-attachments/assets/988ffa45-e163-48f4-8763-f237e06cd543)

And the text input:

```
This is a picture from tweet, and the corresponding text is:
CONGRATS ON HITTING YOIR GOAL GUYS, I'm sure the victims of Harvey will appreciate it greatly https://t.co/daPhXZvhuY
Please judge the humanitarian type in the image, you can only choose one answer exactly from the following types: 
'not_humanitarian', 'injured_or_dead_people', 'other_relevant_information', 'affected_individuals',                 'infrastructure_and_utility_damage', 'rescue_volunteering_or_donation_effort', 'vehicle_damage',                 'missing_or_found_people'
```

The tweet is from the CrisisMMD dataset.

My codes: 

```py
def ask_blip(image_path: str, question: str):
    image = Image.open(image_path).convert("RGB")

    image = vis_processors["eval"](image).unsqueeze(0).to("cuda")
    question = txt_processors["eval"](question)
    return model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")[0]
```

And error raised:

```
Traceback (most recent call last):
  File "/root/shared-nvme/baselines/blip.py", line 121, in <module>
    main()
  File "/root/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/shared-nvme/baselines/blip.py", line 95, in main
    responses["humanitarian"] = ask_blip(image_path, textwrap.dedent(f"""\
  File "/root/shared-nvme/baselines/blip.py", line 24, in ask_blip
    return model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")[0]
  File "/root/.local/lib/python3.10/site-packages/lavis/models/blip_models/blip_vqa.py", line 225, in predict_answers
    return self._generate_answers(
  File "/root/.local/lib/python3.10/site-packages/lavis/models/blip_models/blip_vqa.py", line 259, in _generate_answers
    outputs = self.text_decoder.generate(
  File "/root/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2345, in generate
    result = self._beam_search(
  File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 3760, in _beam_search
    model_outputs = self(**model_inputs, return_dict=True)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 1210, in forward
    outputs = self.bert(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 974, in forward
    encoder_outputs = self.encoder(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 592, in forward
    layer_outputs = layer_module(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 475, in forward
    cross_attention_outputs = self.crossattention(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 346, in forward
    self_outputs = self.self(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 219, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (3) must match the size of tensor b (9) at non-singleton dimension 0
```

I also tried BLIP in the HuggingFace, and it raised same exception. How to fix it? :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor size mismatch for inference on BLIP. #794

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tensor size mismatch for inference on BLIP. #794

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions