Skip to content

iSEE-Laboratory/ReferDINO

Repository files navigation

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang¹   Kun-Yu Lin¹   Chaolei Tan¹   Jianguo Zhang²   Wei-Shi Zheng¹   Jian-Fang Hu¹*

¹Sun Yat-sen University   ²Southern University of Science and Technology

ICCV 2025

visual

📢 News

  • 2025.8.09 Our demo is available on HuggingFace Space! Try here!
  • 2025.8.09 Demo script is available.
  • 2025.6.28 Swin-B checkpoints are released.
  • 2025.6.27 All training and inference code for ReferDINO is released.
  • 2025.6.25 ReferDINO is accepted to ICCV 2025 ! 🎉
  • 2025.3.28 Our ReferDINO-Plus, an ensemble approach of ReferDINO and SAM2, achieved the 2nd place in PVUW challenge RVOS Track at CVPR 2025! 🎉 See our report for details!

👨‍💻 TODO

  • Release demo code and online demo.
  • Release model weights.
  • Release training and inference code.

🔎 Framework

model

Environment Setup

We have tested our code in PyTorch 1.11 and 2.5.1, so either version should be compatible.

# Clone the repo
git clone https://github.com/iSEE-Laboratory/ReferDINO.git
cd ReferDINO

# [Optional] Create a clean Conda environment
conda create -n referdino python=3.10 -y
conda activate referdino

# Pytorch
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1  pytorch-cuda=11.8 -c pytorch -c nvidia

# MultiScaleDeformableAttention
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..

# Other dependencies
pip install -r requirements.txt 

Download pretrained GroundingDino as follows and put them in the diretory pretrained.

wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth

Try ReferDINO on your video

We provide a script to quickly apply ReferDINO to the given video and text.

python demo_video.py <video_path> --text "a description for the target" -ckpt ckpt/ryb_mevis_swinb.pth

Data Preparation

Please refer to DATA.md for data preparation.

The directory struture is organized as follows.

ReferDINO/
├── configs
├── data
│   ├── coco
│   ├── a2d_sentences
│   ├── jhmdb_sentences
│   ├── mevis
│   ├── ref_davis
│   └── ref_youtube_vos
├── datasets
├── models
├── eval
├── tools
├── util
├── pretrained
├── ckpt
├── misc.py
├── pretrainer.py
├── trainer.py
└── main.py

Get Started

The results will be saved in output/{dataset}/{version}/. If you encounter OOM errors, please reduce the batch_size or the num_frames in config file.

  • Pretrain Swin-B on coco datasets with 8 GPUs. You can either specify the gpu indices with --gids 0 1 2 3.
python main.py -c configs/coco_swinb.yaml -rm pretrain -bs 12 -ng 6 --epochs 20 --version swint --eval_off
  • Finetuning on Refer-YouTube-VOS with the pretrained checkpoints.
python main.py -c configs/ytvos_swinb.yaml -rm train -bs 2 -ng 8 --version swinb -pw ckpt/coco_swinb.pth --eval_off
  • Inference on Refer-YouTube-VOS.
PYTHONPATH=. python eval/inference_ytvos.py -c configs/ytvos_swinb.yaml -ng 8 -ckpt ckpt/ryt_swinb.pth --version swinb
  • Inference on MeViS Valid Set.
PYTHONPATH=. python eval/inference_mevis.py --split valid -c configs/mevis_swinb.yaml -ng 8 ckpt/mevis_swinb.pth --version swinb
  • Inference on A2D-Sentences (or JHMDB-Sentences).
PYTHONPATH=. python main.py -c configs/a2d_swinb.yaml -rm train -ng 8 --version swinb -ckpt ckpt/a2d_swinb.pth --eval

Model Zoo

We have released the following model weights on HuggingFace. Please download and put them in the diretory ckpt.

Train Set Backbone Checkpoint
coco Swin-B coco_swinb.pth
coco, ref-youtube-vos Swin-B ryt_swinb.pth
coco, a2d-sentences Swin-B a2d_swinb.pth
mevis Swin-B mevis_swinb.pth
coco, ref-youtube-vos, mevis Swin-B ryt_mevis_swinb.pth

Acknowledgements

Our code is built upon ReferFormer, SOC and GroundingDINO. We sincerely appreciate these efforts.

Citation

If you find our work helpful for your research, please consider citing our paper.

@inproceedings{liang2025referdino,
    title={ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations},
    author={Liang, Tianming and Lin, Kun-Yu and Tan, Chaolei and Zhang, Jianguo and Zheng, Wei-Shi and Hu, Jian-Fang},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
    year={2025}
}

About

(ICCV 2025) ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published