🚀 AGNOSTOS: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
[🌐 Project Page] | [📄 Paper] | 🤗 Huggingface Data | 🤗 Huggingface Model | [📺 Video]
Jiaming Zhou1, Ke Ye1, Jiayi Liu1, Teli Ma1, Zifan Wang1, Ronghe Qiu1, Kun-Yu Lin2, Zhilin Zhao3, Junwei Liang1,4
1HKUST (Guangzhou), 2HKU, 3SYSU, 4HKUST
The project introduces AGNOSTOS, a simulation manipulation benchmark designed to rigorously evaluate cross-task zero-shot generalization of Vision-Language-Action models, and proposes Cross-Task In-Context Manipulation (X-ICM), a method that significantly improves cross-task generalization capabilities.
Please refer to INSTALL_docker.md to initialize your environment.
Please refer to INSTALL_manual.md (adapted from RoboPrompt) for manual installation instructions.
The benchmark consists of two parts (all data are available at huggingface):
-
📚 18 seen tasks for training (140G in total, split into five files), links:
[seen_tasks.part_aa] | [seen_tasks.part_ab] | [seen_tasks.part_ac] | [seen_tasks.part_ad] | [seen_tasks.part_ae]
-
🔍 23 unseen tasks for cross-task testing (20.2GB, one single file), link:
After downloading, process the files:
### for seen task data, combine all five files
cat seen_tasks.part_* > seen_tasks.tar
### check the file, it should be "8217d78779acfd2873d0f55849c8efcc"
md5sum seen_tasks.tar
tar -xvf seen_tasks.tar
tar -xvf unseen_tasks.tar
Creating symbolic links to the sub-folder data
:
cd X-ICM
mkdir data
ln -s /path/to/seen_tasks data/
ln -s /path/to/unseen_tasks data/
Download our pre-trained dynamics diffusion model from [dynamics_diffusion.tar] for cross-task in-context sample selection.
After downloading, extract and create a symbolic link to the sub-folder data
.
tar -xvf dynamics_diffusion.tar
cd X-ICM
ln -s /path/to/dynamics_diffusion data/
Run script eval_XICM.sh
with the below parameters:
### set seed numbers for different runs
seeds: [example: "0,99"]
### set the number of rollouts for each run
episodes: [example: 25]
### set the method of LLM
modelname: [example: Qwen2.5.7B.instruct]
### set the number of cross-task in-context samples
num_icls: [example: 18]
### set the gpu list
gpu_ids: [example: 0,1]
### set the in-context sample selection method
ranking_method: [example: "lang_vis.out"]
For dynamics-guided in-context manipulation, run:
cd X-ICM
bash eval_scripts/eval_XICM.sh "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "lang_vis.out"
Reminder: During evaluation, you need to load the Stable-Diffusion model and Qwen-LLM models from huggingface.
You can manually download them from huggingface and load them from the local paths in load_weight func
and model_path
, accordingly.
For random selection of cross-task samples, run:
cd X-ICM
bash eval_scripts/eval_XICM.sh "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "random"
After testing, you can use gather_score.py
to collect and analyze the results.
We provide the testing results of our X-ICM (7B)
and X-ICM (72B)
models under the sub-folder logs/
.
- X-ICM (7B) achieves 23.5% average success rate and X-ICM (72B) achieves 30.1% average success rate, both versions outperform all existing VLA models;
- X-ICM (7B) fails on only two tasks, while X-ICM (72B) succeeds on all tasks;
Due to the embodiment gap, existing VLA models need to be fine-tuned on RLBench data.
Please follow your VLA model's fine-tuning guidelines to fine-tune your models on our 18 seen tasks, and then test the models on our 23 unseen tasks.
Modify the custom_agent.py
file:
-
Load your VLA model in the
load_weights
function; -
Implement VLA model inference in the
_inference
function, including input construction and output format conversion; -
Run the evaluation:
bash eval_CustomModel.sh seeds episodes gpu_ids
Example:
bash eval_scripts/eval_CustomModel.sh "0,99" 25 0,1
💡 Note: Different VLA models may require different input image sizes (default is 256x256).
💡 Please modify IMAGE_SIZE
in main_custom.py
accordingly.
This repository is built upon the RoboPrompt. Some resources from RVT and RLBench are used in this work.
If you find our work helpful to your research, please kindly give us a star and cite our paper.
@article{zhou2025exploring,
title={Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization},
author={Zhou, Jiaming and Ye, Ke and Liu, Jiayi and Ma, Teli and Wang, Zifang and Qiu, Ronghe and Lin, Kun-Yu and Zhao, Zhilin and Liang, Junwei},
journal={arXiv preprint arXiv:2505.15660},
year={2025}
}