🚀 AGNOSTOS: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

[🌐 Project Page] | [📄 Paper] | 🤗 Huggingface Data | 🤗 Huggingface Model | [📺 Video]

Jiaming Zhou¹, Ke Ye¹, Jiayi Liu¹, Teli Ma¹, Zifan Wang¹, Ronghe Qiu¹, Kun-Yu Lin², Zhilin Zhao³, Junwei Liang^1,4

¹HKUST (Guangzhou), ²HKU, ³SYSU, ⁴HKUST

📝 Overview

The project introduces AGNOSTOS, a simulation manipulation benchmark designed to rigorously evaluate cross-task zero-shot generalization of Vision-Language-Action models, and proposes Cross-Task In-Context Manipulation (X-ICM), a method that significantly improves cross-task generalization capabilities.

🔧 Environment Setup

🐳 Option 1: Using Docker (Recommended)

Please refer to INSTALL_docker.md to initialize your environment.

⚙️ Option 2: Manual Setup

Please refer to INSTALL_manual.md (adapted from RoboPrompt) for manual installation instructions.

📊 AGNOSTOS Benchmark Data

The benchmark consists of two parts (all data are available at huggingface):

📚 18 seen tasks for training (140G in total, split into five files), links:

[seen_tasks.part_aa] | [seen_tasks.part_ab] | [seen_tasks.part_ac] | [seen_tasks.part_ad] | [seen_tasks.part_ae]
🔍 23 unseen tasks for cross-task testing (20.2GB, one single file), link:

[unseen_tasks.tar]

After downloading, process the files:

### for seen task data, combine all five files
cat seen_tasks.part_* > seen_tasks.tar
### check the file, it should be "8217d78779acfd2873d0f55849c8efcc"
md5sum seen_tasks.tar 

tar -xvf seen_tasks.tar
tar -xvf unseen_tasks.tar

Creating symbolic links to the sub-folder data:

cd X-ICM
mkdir data
ln -s /path/to/seen_tasks data/
ln -s /path/to/unseen_tasks data/

🤖 X-ICM Method

1️⃣ Model Download

Download our pre-trained dynamics diffusion model from [dynamics_diffusion.tar] for cross-task in-context sample selection.

After downloading, extract and create a symbolic link to the sub-folder data.

tar -xvf dynamics_diffusion.tar

cd X-ICM
ln -s /path/to/dynamics_diffusion data/

2️⃣ Evaluation

Run script eval_XICM.sh with the below parameters:

### set seed numbers for different runs
seeds: [example: "0,99"]
### set the number of rollouts for each run
episodes: [example: 25]
### set the method of LLM
modelname: [example: Qwen2.5.7B.instruct]
### set the number of cross-task in-context samples
num_icls: [example: 18]
### set the gpu list
gpu_ids: [example: 0,1]
### set the in-context sample selection method
ranking_method: [example: "lang_vis.out"]

For dynamics-guided in-context manipulation, run:

cd X-ICM
bash eval_scripts/eval_XICM.sh "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "lang_vis.out"

Reminder: During evaluation, you need to load the Stable-Diffusion model and Qwen-LLM models from huggingface. You can manually download them from huggingface and load them from the local paths in load_weight func and model_path, accordingly.

For random selection of cross-task samples, run:

cd X-ICM
bash eval_scripts/eval_XICM.sh "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "random"

After testing, you can use gather_score.py to collect and analyze the results.

🎯 Benchmarking Results over all 23 unseen tasks

We provide the testing results of our X-ICM (7B) and X-ICM (72B) models under the sub-folder logs/.

X-ICM (7B) achieves 23.5% average success rate and X-ICM (72B) achieves 30.1% average success rate, both versions outperform all existing VLA models;
X-ICM (7B) fails on only two tasks, while X-ICM (72B) succeeds on all tasks;

🔬 Benchmarking Your VLA Model on AGNOSTOS

1️⃣ Fine-tuning

Due to the embodiment gap, existing VLA models need to be fine-tuned on RLBench data.

Please follow your VLA model's fine-tuning guidelines to fine-tune your models on our 18 seen tasks, and then test the models on our 23 unseen tasks.

2️⃣ Testing Fine-tuned VLA Models

Modify the custom_agent.py file:

Load your VLA model in the load_weights function;
Implement VLA model inference in the _inference function, including input construction and output format conversion;

Run the evaluation:

bash eval_CustomModel.sh seeds episodes gpu_ids

Example:

bash eval_scripts/eval_CustomModel.sh "0,99" 25 0,1

💡 Note: Different VLA models may require different input image sizes (default is 256x256).

💡 Please modify IMAGE_SIZE in main_custom.py accordingly.

🙏 Acknowledgments

This repository is built upon the RoboPrompt. Some resources from RVT and RLBench are used in this work.

📄 Citation

If you find our work helpful to your research, please kindly give us a star and cite our paper.

@article{zhou2025exploring,
    title={Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization},
    author={Zhou, Jiaming and Ye, Ke and Liu, Jiayi and Ma, Teli and Wang, Zifang and Qiu, Ronghe and Lin, Kun-Yu and Zhao, Zhilin and Liang, Junwei},
    journal={arXiv preprint arXiv:2505.15660},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
PyRep		PyRep
RLBench		RLBench
YARR		YARR
docs		docs
eval_scripts		eval_scripts
logs		logs
INSTALL_docker.md		INSTALL_docker.md
INSTALL_manual.md		INSTALL_manual.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
crosstask_icl_agent.py		crosstask_icl_agent.py
custom_agent.py		custom_agent.py
form_icl_demonstrations_crosstask_ranking.py		form_icl_demonstrations_crosstask_ranking.py
gather_score.py		gather_score.py
main.py		main.py
main_custom.py		main_custom.py
requirements.txt		requirements.txt
rlbench_inference_dynamics_diffusion.py		rlbench_inference_dynamics_diffusion.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 AGNOSTOS: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

📝 Overview

🔧 Environment Setup

🐳 Option 1: Using Docker (Recommended)

⚙️ Option 2: Manual Setup

📊 AGNOSTOS Benchmark Data

🤖 X-ICM Method

1️⃣ Model Download

2️⃣ Evaluation

🎯 Benchmarking Results over all 23 unseen tasks

🔬 Benchmarking Your VLA Model on AGNOSTOS

1️⃣ Fine-tuning

2️⃣ Testing Fine-tuned VLA Models

🙏 Acknowledgments

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jiaming-zhou/X-ICM

Folders and files

Latest commit

History

Repository files navigation

🚀 AGNOSTOS: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

📝 Overview

🔧 Environment Setup

🐳 Option 1: Using Docker (Recommended)

⚙️ Option 2: Manual Setup

📊 AGNOSTOS Benchmark Data

🤖 X-ICM Method

1️⃣ Model Download

2️⃣ Evaluation

🎯 Benchmarking Results over all 23 unseen tasks

🔬 Benchmarking Your VLA Model on AGNOSTOS

1️⃣ Fine-tuning

2️⃣ Testing Fine-tuned VLA Models

🙏 Acknowledgments

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages