gen2seg: Generative Models Enable Generalizable Instance Segmentation
Om Khangaonkar,
Hamed Pirsiavash
UC Davis
Stable Diffusion 2 (SD): https://huggingface.co/reachomk/gen2seg-sd
ImageNet-1K-pretrained Masked Autoencoder-Huge (MAE-H): https://huggingface.co/reachomk/gen2seg-mae-h
If you want any of our other models, send me an email. If there is sufficient demand, I will also release them publicly.
Please set up the environment by running
conda env create -f environment.yml
and then
conda activate gen2seg
Currently, we have released inference code for our SD and MAE models. You can run them by editing the image_path
variable (for your input image) in each file, and then simply running it with python inference_{mae or sd}.py
.
You will need to have transformers
and diffusers
installed, along with standard machine learning packages such as pytorch
and numpy
. More details on our specific environment will be released with the training code.
We have also released code for prompting. Please run pip install opencv-contrib-python
prior to running this file if you didn't start from our conda environment.
Here is how you run it:
python prompting.py \
--feature_image /path/to/your/feature_image.png \
--prompt_x [prompt pixel x] \
--prompt_y [prompt pixel y] \
The feature image is the one generated by our model, NOT the original image.
We also have the additional optional arguments:
--output_mask /path/to/save/output_mask.png
--sigma [value between 0 and 1]
--threshold [value between 0 and 255]
Threshold and sigma allow you to control the mask threshold and the amount of averaging for the query vector, respectively. By default they are 0.01 and 3. See our paper for more details.
We have also provided our inference script for SAM, to enable qualitative comparison. Please make sure you download the checkpoint and input the path in the script. You should also edit the image_path
variable (for your input image).
You will probably need a 48 GB GPU to train our SD model, but MAE will work on 24GB.
We use two datasets, Hypersim and Virtual Kitti 2.
You can download Virtual Kitti 2 directly from this link: https://europe.naverlabs.com/proxy-virtual-worlds-vkitti-2/
Please download the rgb and instanceSegmentation tars. To work off-the-shelf with our current dataloader, please extract them into the same directory. This way, for a given scene, the RGB and segmentation will be under frames/rgb
and frames/instanceSegmentation
respectively. You can see the VirtualKITTI2._find_pairs
function in training/dataloaders/load.py
for more details.
For Hypersim, I recommend downloading using this script: https://github.com/apple/ml-hypersim/tree/main/contrib/99991
Assuming you have a root folder root
, you should download the RGB frames (scene_cam_00_final_preview/*.color.jpg
) into root/rgb
. You also will need to download the segmentation annotations (scene_cam_03_geometry_hdf5/*..semantic_instance.hdf5
). You will to convert these RGB annotations by assigning the background as black and each mask a unique color (that is not black or white). Please delete all frames that do not have any annotations. If you keep these it will degrade performance. I also found deleting scenes with less than 10 annotated objects helped. Please place the colored annotations into root/instance-rgb
.
You will need to specify the path to each dataset at line 360 in training/train.py
, or line 274 in training/train_mae_full.py
.
Before beginning, please modify the num_processes
variable in training/scripts/multi_gpu.yaml
with the number of GPUs you want to parallelize over.
To train our models, please run the following scripts. Descriptions of the arguments are available in the respective training scripts.
Stable Diffusion:
./training/scripts/train_stable_diffusion_e2e_ft_instance.sh
MAE:
./training/scripts/train_mae_full_e2e_ft_instance.sh
Please let me know if you want more details or have any questions.
Please cite our paper if it was helpful or you liked it.
@article{khangaonkar2025gen2seg,
title={gen2seg: Generative Models Enable Generalizable Instance Segmentation},
author={Om Khangaonkar and Hamed Pirsiavash},
year={2025},
journal={arXiv preprint arXiv:2505.15263}
}