Skip to content

Commit 082bab8

Browse files
authored
Merge pull request #284 from mschweig/feat/robopoint-tutorial
Add RoboPoint Tutorial
2 parents fe9c80d + 7cedad9 commit 082bab8

File tree

5 files changed

+120
-0
lines changed

5 files changed

+120
-0
lines changed
16.9 MB
Loading

docs/images/robopoint_gradio.png

370 KB
Loading

docs/images/robopoint_spot.GIF

9.21 MB
Loading

docs/robopoint.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Tutorial - RoboPoint VLM for robotic manipulation
2+
3+
[RoboPoint](https://robo-point.github.io/) is a general model that enables several downstream applications such as robot navigation, manipulation, and augmented reality (AR) assistance.
4+
5+
<img width="960px" src="images/robopoint_spot.GIF">
6+
7+
This tutorial provides a demo application for robotic manipulation using a Vision-Language Model (VLM) pipeline combined with a Large Language Model (LLM) to articulate manipulators using natural language. The RoboPoint inference pipeline generates 2D action points, which can be projected to 3D targets using depth maps or renowned algorithms like OpenCV [solvePNP](https://docs.opencv.org/4.x/d5/d1f/calib3d_solvePnP.html). The computed 3D targets can be fed into motion planning and deployed to real hardware or simulation environments like Isaac Sim. Future phases will include ROS2 integration with an Isaac Sim pipeline and the implementation of quantization methods.
8+
9+
In this tutorial we will guide you through:
10+
11+
:white_check_mark: Setting up the environment using jetson-containers
12+
:white_check_mark: Connecting a [Boston Dynamics Spot with Arm](https://bostondynamics.com/products/spot/arm/){:target="_blank"} to RoboPoint VLM
13+
:white_check_mark: Issuing commands using natural language prompts
14+
:white_check_mark: Executing pick-and-place operations
15+
16+
## RoboPoint VLM for embodied AI
17+
18+
From rearranging objects on a table to putting groceries into shelves, robots must plan precise action points to perform tasks accurately and reliably. In spite of the recent adoption of vision language models (VLMs) to control robot behavior, VLMs struggle to precisely articulate robot actions using language. We introduce an automatic synthetic data generation pipeline that instruction-tunes VLMs to robotic domains and needs.
19+
20+
### RoboPoint Pipeline
21+
22+
<img width="960px" src="images/robopoint_architecture.png">
23+
24+
Source: [RoboPoint Paper](https://arxiv.org/pdf/2406.10721)
25+
26+
An RGB image is rendered from a procedurally generated 3D scene. We compute spatial relations from the camera's perspective and generate affordances by sampling points within object masks and object-surface intersections. These instruction-point pairs fine-tune the language model. During deployment, RoboPoint predicts 2D action points from an image and instruction, which are projected into 3D using a depth map. The robot then navigates to these 3D targets with a motion planner. For more information please refer the official [paper and project](https://robo-point.github.io)
27+
28+
!!! abstract "Credits"
29+
Thank you to University of Washington, NVIDIA, Allen Institute for Artificial Intelligence and Universidad Catolica San Pablo for publishing their great research.
30+
31+
### Advantages of the proposed architecture
32+
33+
One key advantage of this architecture is its efficiency. The process of projecting 2D action points into 3D poses is both fast and computationally lightweight. This ensures smooth robotic manipulation, enabling rapid execution of even complex, long-term, and sequential commands.
34+
35+
## 1. Setting up the environment with `jetson-containers`
36+
37+
!!! abstract "What you need"
38+
39+
1. One of the following Jetson devices:
40+
41+
<span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
42+
<span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
43+
44+
2. Running one the following version of [JetPack](https://developer.nvidia.com/embedded/jetpack):
45+
46+
<span class="blobPink2">JetPack 6 (L4T r36.x)</span>
47+
48+
3. <span class="markedYellow">NVMe SSD **highly recommended**</span> for storage speed and space
49+
50+
- `25GB` for `robopoint-v1-vicuna-v1.5-13b` LLM
51+
- `5.3GB` for `robopoint` container image
52+
53+
4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
54+
55+
```bash
56+
git clone https://github.com/dusty-nv/jetson-containers
57+
bash jetson-containers/install.sh
58+
```
59+
60+
5. Take a look at the container [README](https://github.com/dusty-nv/jetson-containers/blob/master/packages/robots/robopoint/README.md){:target="_blank"}
61+
62+
6. Run the RoboPoint Container
63+
64+
```bash
65+
jetson-containers $(autotag robopoint)
66+
```
67+
68+
## 2. Gradio Demo Application
69+
70+
The project includes a Gradio demo application, packaged within the provided container. To access the interface and execute commands, simply open a web browser and navigate to `http://jetson-ip:7860/`. You should see a Gradio WebApp with Demo Examples as shown here.
71+
72+
<img width="960px" src="images/robopoint_gradio.png">
73+
74+
## 3. Boston Dynamics Spot Deployment
75+
76+
Connect the RoboPoint VLM to a Boston Dynamics Spot with Arm for mobile manipulation scenarios. The inference is performed using the Gradio API. The results are then parsed, projected into 3D and sent to the inverse kinematics (IK) solver of the Boston Dynamics Spot API. The required steps are outlined below.
77+
!!! warning "Disclaimer: Use at your own risk"
78+
Please note that controlling the robot to grasp an object involves moving parts that may cause damage or harm to people or property. Ensure that the operating environment is clear of obstacles and that all personnel maintain a safe distance from the robot during operation. Always follow safety guidelines and protocols provided by the robot manufacturer.
79+
80+
!!! abstract "What we will do"
81+
82+
1. Setup your Python Spot SDK environment: [`Spot SDK`](https://dev.bostondynamics.com/docs/python/quickstart){:target="_blank"}
83+
84+
2. Deploy the [`RoboPoint jetson-container`](#1-setting-up-the-environment-with-jetson-containers)
85+
86+
3. Use the RoboPoint [Spot example](https://github.com/mschweig/RoboPoint/tree/master/examples){:target="_blank"} to execute the following steps:
87+
88+
```bash
89+
pip3 install -r requirements.txt
90+
python3 robopoint_spot_example.py -i frontleft -l "pick the object next to the ball" -g "http://jetson-ip:7860"
91+
```
92+
93+
a. Connect to the robot and acquire a lease to control the robot
94+
95+
b. Use the gradio inference API to predict 2D action points
96+
97+
c. Project the 2D action points to a 3D pose using the robots API
98+
99+
d. Run the motion planning
100+
101+
e. Execute the grasp command
102+
103+
!!! example "Work in Progress"
104+
105+
- ROS2 Integration
106+
- Isaac Sim Integration
107+
- Ask questions in [`#vla`](https://discord.gg/BmqNSK4886){:target="_blank"} on Discord or [`jetson-containers/issues`](https://github.com/dusty-nv/jetson-containers/issues){:target="_blank"}
108+
109+
110+
## Optional: No robot at hand? Demo script with camera input
111+
112+
The Gradio inference API enables seamless command execution for other robots or testing purposes. It simplifies integration and allows for quick deployment across different robotic platforms or testing purpose. We provide a convenient [demo script](https://github.com/dusty-nv/jetson-containers/blob/master/packages/robots/robopoint/client.py){:target="_blank"} to test the API and inference with images or a live camera input. Start the [RoboPoint container](#1-setting-up-the-environment-with-jetson-containers) and execute the `client.py`demo script.
113+
Run `python3 client.py --help` for input parameter details.
114+
115+
```bash
116+
python3 client.py --request 'Find free space between the plates in the sink' --camera 0
117+
```
118+
119+
You will receive an `output_image.jpg` with the predicted 2D action points, and the coordinates will be logged to the command line. Use this result to verify the inference on your images.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ nav:
119119
- LeRobot: lerobot.md
120120
- ROS2 Nodes: ros.md
121121
- OpenVLA: openvla.md
122+
- RoboPoint 🆕: robopoint.md
122123
- Image Generation:
123124
- Flux & ComfyUI: tutorial_comfyui_flux.md
124125
- Stable Diffusion: tutorial_stable-diffusion.md

0 commit comments

Comments
 (0)