Skip to content

Commit 338e503

Browse files
authored
Merge pull request #212 from dusty-nv/20250925-content
added Llama-Vision
2 parents 038bc24 + 487b4f0 commit 338e503

File tree

2 files changed

+98
-0
lines changed

2 files changed

+98
-0
lines changed

docs/llama_vlm.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Llama 3.2 Vision
2+
3+
The latest additions to Meta's family of foundation LLMs include multimodal vision/language models (VLMs) in 11B and 90B sizes with high-resolution image inputs (1120x1120) and cross-attention with base completion and instruction-tuned chat variants:
4+
5+
* [`Llama-3.2-11B-Vision`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)
6+
* [`Llama-3.2-11B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
7+
* [`Llama-3.2-90B-Vision`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision)
8+
* [`Llama-3.2-90B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct)
9+
10+
While quantization and optimization efforts are underway, we have started with running the unquantized 11B model in a container based on HuggingFace Transformers that has been updated with the latest support for Llama-3.2-Vision a jump start on trying out these exciting new multimodal models - thanks to Meta for continuing to release open Llama models!
11+
12+
!!! abstract "What you need"
13+
14+
1. One of the following Jetson devices:
15+
16+
<span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
17+
<span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
18+
19+
2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):
20+
21+
<span class="blobPink2">JetPack 6 (L4T r36)</span>
22+
23+
3. Sufficient storage space (preferably with NVMe SSD).
24+
25+
- `12.8GB` for `llama-vision` container image
26+
- Space for models (`>25GB`)
27+
28+
4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
29+
30+
```bash
31+
git clone https://github.com/dusty-nv/jetson-containers
32+
bash jetson-containers/install.sh
33+
```
34+
35+
5. Request access to the gated models [here](https://huggingface.co/meta-llama) with your HuggingFace API key.
36+
37+
38+
## Code Example
39+
40+
Today Llama-3.2-11B-Vision is able to be run on Jetson AGX Orin in FP16 via HuggingFace Transformers. Here's a simple code example from the model card for using it:
41+
42+
```python
43+
import time
44+
import requests
45+
import torch
46+
47+
from PIL import Image
48+
from transformers import MllamaForConditionalGeneration, AutoProcessor
49+
50+
model_id = "meta-llama/Llama-3.2-11B-Vision"
51+
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
52+
processor = AutoProcessor.from_pretrained(model_id)
53+
54+
prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
55+
url = "https://llava-vl.github.io/static/images/view.jpg"
56+
raw_image = Image.open(requests.get(url, stream=True).raw)
57+
58+
inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(model.device)
59+
output = model.generate(**inputs, do_sample=False, max_new_tokens=32)
60+
```
61+
62+
<img src="https://llava-vl.github.io/static/images/view.jpg">
63+
64+
```
65+
If I had to write a haiku for this one, it would be:
66+
67+
A dock on a lake.
68+
A mountain in the distance.
69+
A long exposure.
70+
```
71+
72+
Initial testing seems that Llama-3.2-Vision has more conversational abilities than VLMs typically retain after VQA alignment. This [llama_vision.py](https://github.com/dusty-nv/jetson-containers/blob/master/packages/vlm/llama-vision/llama_vision.py) script has interactive completion and image loading to avoid re-loading the model. It can be launched from the container like this:
73+
74+
```bash
75+
jetson-containers run \
76+
-e HUGGINGFACE_TOKEN=YOUR_API_KEY \
77+
$(autotag llama-vision) \
78+
python3 /opt/llama_vision.py \
79+
--model "meta-llama/Llama-3.2-11B-Vision" \
80+
--image "/data/images/hoover.jpg" \
81+
--prompt "I'm out in the" \
82+
--max-new-tokens 32 \
83+
--interactive
84+
```
85+
86+
After processing the initial [image](https://github.com/dusty-nv/jetson-containers/blob/master/data/images/hoover.jpg), it will ask you to submit another prompt or image:
87+
88+
```
89+
total 4.8346s (39 tokens, 8.07 tokens/sec)
90+
91+
Enter prompt or image path/URL:
92+
93+
>>
94+
```
95+
96+
We will update this page and container as support for the Llama-3.2-Vision architecture is added to quantization APIs like MLC and llama.cpp for GGUF, which will reduce the memory and latency.
97+

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ nav:
9494
- LLaVA: tutorial_llava.md
9595
- Live LLaVA: tutorial_live-llava.md
9696
- NanoVLM: tutorial_nano-vlm.md
97+
- Llama 3.2 Vision: llama_vlm.md
9798
- Vision Transformers (ViT):
9899
- vit/index.md
99100
- EfficientViT: vit/tutorial_efficientvit.md

0 commit comments

Comments
 (0)