|
| 1 | +# Llama 3.2 Vision |
| 2 | + |
| 3 | +The latest additions to Meta's family of foundation LLMs include multimodal vision/language models (VLMs) in 11B and 90B sizes with high-resolution image inputs (1120x1120) and cross-attention with base completion and instruction-tuned chat variants: |
| 4 | + |
| 5 | +* [`Llama-3.2-11B-Vision`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) |
| 6 | +* [`Llama-3.2-11B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) |
| 7 | +* [`Llama-3.2-90B-Vision`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision) |
| 8 | +* [`Llama-3.2-90B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) |
| 9 | + |
| 10 | +While quantization and optimization efforts are underway, we have started with running the unquantized 11B model in a container based on HuggingFace Transformers that has been updated with the latest support for Llama-3.2-Vision a jump start on trying out these exciting new multimodal models - thanks to Meta for continuing to release open Llama models! |
| 11 | + |
| 12 | +!!! abstract "What you need" |
| 13 | + |
| 14 | + 1. One of the following Jetson devices: |
| 15 | + |
| 16 | + <span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span> |
| 17 | + <span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span> |
| 18 | + |
| 19 | + 2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack): |
| 20 | + |
| 21 | + <span class="blobPink2">JetPack 6 (L4T r36)</span> |
| 22 | + |
| 23 | + 3. Sufficient storage space (preferably with NVMe SSD). |
| 24 | + |
| 25 | + - `12.8GB` for `llama-vision` container image |
| 26 | + - Space for models (`>25GB`) |
| 27 | + |
| 28 | + 4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}: |
| 29 | + |
| 30 | + ```bash |
| 31 | + git clone https://github.com/dusty-nv/jetson-containers |
| 32 | + bash jetson-containers/install.sh |
| 33 | + ``` |
| 34 | + |
| 35 | + 5. Request access to the gated models [here](https://huggingface.co/meta-llama) with your HuggingFace API key. |
| 36 | + |
| 37 | + |
| 38 | +## Code Example |
| 39 | + |
| 40 | +Today Llama-3.2-11B-Vision is able to be run on Jetson AGX Orin in FP16 via HuggingFace Transformers. Here's a simple code example from the model card for using it: |
| 41 | + |
| 42 | +```python |
| 43 | +import time |
| 44 | +import requests |
| 45 | +import torch |
| 46 | + |
| 47 | +from PIL import Image |
| 48 | +from transformers import MllamaForConditionalGeneration, AutoProcessor |
| 49 | + |
| 50 | +model_id = "meta-llama/Llama-3.2-11B-Vision" |
| 51 | +model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16) |
| 52 | +processor = AutoProcessor.from_pretrained(model_id) |
| 53 | + |
| 54 | +prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one" |
| 55 | +url = "https://llava-vl.github.io/static/images/view.jpg" |
| 56 | +raw_image = Image.open(requests.get(url, stream=True).raw) |
| 57 | + |
| 58 | +inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(model.device) |
| 59 | +output = model.generate(**inputs, do_sample=False, max_new_tokens=32) |
| 60 | +``` |
| 61 | + |
| 62 | +<img src="https://llava-vl.github.io/static/images/view.jpg"> |
| 63 | + |
| 64 | +``` |
| 65 | +If I had to write a haiku for this one, it would be: |
| 66 | +
|
| 67 | +A dock on a lake. |
| 68 | +A mountain in the distance. |
| 69 | +A long exposure. |
| 70 | +``` |
| 71 | + |
| 72 | +Initial testing seems that Llama-3.2-Vision has more conversational abilities than VLMs typically retain after VQA alignment. This [llama_vision.py](https://github.com/dusty-nv/jetson-containers/blob/master/packages/vlm/llama-vision/llama_vision.py) script has interactive completion and image loading to avoid re-loading the model. It can be launched from the container like this: |
| 73 | + |
| 74 | +```bash |
| 75 | +jetson-containers run \ |
| 76 | + -e HUGGINGFACE_TOKEN=YOUR_API_KEY \ |
| 77 | + $(autotag llama-vision) \ |
| 78 | + python3 /opt/llama_vision.py \ |
| 79 | + --model "meta-llama/Llama-3.2-11B-Vision" \ |
| 80 | + --image "/data/images/hoover.jpg" \ |
| 81 | + --prompt "I'm out in the" \ |
| 82 | + --max-new-tokens 32 \ |
| 83 | + --interactive |
| 84 | +``` |
| 85 | + |
| 86 | +After processing the initial [image](https://github.com/dusty-nv/jetson-containers/blob/master/data/images/hoover.jpg), it will ask you to submit another prompt or image: |
| 87 | + |
| 88 | +``` |
| 89 | +total 4.8346s (39 tokens, 8.07 tokens/sec) |
| 90 | +
|
| 91 | +Enter prompt or image path/URL: |
| 92 | +
|
| 93 | +>> |
| 94 | +``` |
| 95 | + |
| 96 | +We will update this page and container as support for the Llama-3.2-Vision architecture is added to quantization APIs like MLC and llama.cpp for GGUF, which will reduce the memory and latency. |
| 97 | + |
0 commit comments