Skip to content

Tensor size mismatch for inference on BLIP. #794

@chuanwise

Description

@chuanwise

My image:

Image

And the text input:

This is a picture from tweet, and the corresponding text is:
CONGRATS ON HITTING YOIR GOAL GUYS, I'm sure the victims of Harvey will appreciate it greatly https://t.co/daPhXZvhuY
Please judge the humanitarian type in the image, you can only choose one answer exactly from the following types: 
'not_humanitarian', 'injured_or_dead_people', 'other_relevant_information', 'affected_individuals',                 'infrastructure_and_utility_damage', 'rescue_volunteering_or_donation_effort', 'vehicle_damage',                 'missing_or_found_people'

The tweet is from the CrisisMMD dataset.

My codes:

def ask_blip(image_path: str, question: str):
    image = Image.open(image_path).convert("RGB")

    image = vis_processors["eval"](image).unsqueeze(0).to("cuda")
    question = txt_processors["eval"](question)
    return model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")[0]

And error raised:

Traceback (most recent call last):
  File "/root/shared-nvme/baselines/blip.py", line 121, in <module>
    main()
  File "/root/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/shared-nvme/baselines/blip.py", line 95, in main
    responses["humanitarian"] = ask_blip(image_path, textwrap.dedent(f"""\
  File "/root/shared-nvme/baselines/blip.py", line 24, in ask_blip
    return model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")[0]
  File "/root/.local/lib/python3.10/site-packages/lavis/models/blip_models/blip_vqa.py", line 225, in predict_answers
    return self._generate_answers(
  File "/root/.local/lib/python3.10/site-packages/lavis/models/blip_models/blip_vqa.py", line 259, in _generate_answers
    outputs = self.text_decoder.generate(
  File "/root/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2345, in generate
    result = self._beam_search(
  File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 3760, in _beam_search
    model_outputs = self(**model_inputs, return_dict=True)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 1210, in forward
    outputs = self.bert(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 974, in forward
    encoder_outputs = self.encoder(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 592, in forward
    layer_outputs = layer_module(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 475, in forward
    cross_attention_outputs = self.crossattention(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 346, in forward
    self_outputs = self.self(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lavis/models/med.py", line 219, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (3) must match the size of tensor b (9) at non-singleton dimension 0

I also tried BLIP in the HuggingFace, and it raised same exception. How to fix it? :(

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions