Tianming Liang¹ Kun-Yu Lin¹ Chaolei Tan¹ Jianguo Zhang² Wei-Shi Zheng¹ Jian-Fang Hu¹*
¹Sun Yat-sen University ²Southern University of Science and Technology
ICCV 2025
2025.8.09
Our demo is available on HuggingFace Space! Try here!2025.8.09
Demo script is available.2025.6.28
Swin-B checkpoints are released.2025.6.27
All training and inference code for ReferDINO is released.2025.6.25
ReferDINO is accepted to ICCV 2025 ! 🎉2025.3.28
Our ReferDINO-Plus, an ensemble approach of ReferDINO and SAM2, achieved the 2nd place in PVUW challenge RVOS Track at CVPR 2025! 🎉 See our report for details!
- Release demo code and online demo.
- Release model weights.
- Release training and inference code.
We have tested our code in PyTorch 1.11 and 2.5.1, so either version should be compatible.
# Clone the repo
git clone https://github.com/iSEE-Laboratory/ReferDINO.git
cd ReferDINO
# [Optional] Create a clean Conda environment
conda create -n referdino python=3.10 -y
conda activate referdino
# Pytorch
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=11.8 -c pytorch -c nvidia
# MultiScaleDeformableAttention
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..
# Other dependencies
pip install -r requirements.txt
Download pretrained GroundingDino as follows and put them in the diretory pretrained
.
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth
We provide a script to quickly apply ReferDINO to the given video and text.
python demo_video.py <video_path> --text "a description for the target" -ckpt ckpt/ryb_mevis_swinb.pth
Please refer to DATA.md for data preparation.
The directory struture is organized as follows.
ReferDINO/
├── configs
├── data
│ ├── coco
│ ├── a2d_sentences
│ ├── jhmdb_sentences
│ ├── mevis
│ ├── ref_davis
│ └── ref_youtube_vos
├── datasets
├── models
├── eval
├── tools
├── util
├── pretrained
├── ckpt
├── misc.py
├── pretrainer.py
├── trainer.py
└── main.py
The results will be saved in output/{dataset}/{version}/
. If you encounter OOM errors, please reduce the batch_size or the num_frames
in config file.
- Pretrain
Swin-B
oncoco
datasets with 8 GPUs. You can either specify the gpu indices with--gids 0 1 2 3
.
python main.py -c configs/coco_swinb.yaml -rm pretrain -bs 12 -ng 6 --epochs 20 --version swint --eval_off
- Finetuning on Refer-YouTube-VOS with the pretrained checkpoints.
python main.py -c configs/ytvos_swinb.yaml -rm train -bs 2 -ng 8 --version swinb -pw ckpt/coco_swinb.pth --eval_off
- Inference on Refer-YouTube-VOS.
PYTHONPATH=. python eval/inference_ytvos.py -c configs/ytvos_swinb.yaml -ng 8 -ckpt ckpt/ryt_swinb.pth --version swinb
- Inference on MeViS Valid Set.
PYTHONPATH=. python eval/inference_mevis.py --split valid -c configs/mevis_swinb.yaml -ng 8 ckpt/mevis_swinb.pth --version swinb
- Inference on A2D-Sentences (or JHMDB-Sentences).
PYTHONPATH=. python main.py -c configs/a2d_swinb.yaml -rm train -ng 8 --version swinb -ckpt ckpt/a2d_swinb.pth --eval
We have released the following model weights on HuggingFace. Please download and put them in the diretory ckpt
.
Train Set | Backbone | Checkpoint |
---|---|---|
coco | Swin-B | coco_swinb.pth |
coco, ref-youtube-vos | Swin-B | ryt_swinb.pth |
coco, a2d-sentences | Swin-B | a2d_swinb.pth |
mevis | Swin-B | mevis_swinb.pth |
coco, ref-youtube-vos, mevis | Swin-B | ryt_mevis_swinb.pth |
Our code is built upon ReferFormer, SOC and GroundingDINO. We sincerely appreciate these efforts.
If you find our work helpful for your research, please consider citing our paper.
@inproceedings{liang2025referdino,
title={ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations},
author={Liang, Tianming and Lin, Kun-Yu and Tan, Chaolei and Zhang, Jianguo and Zheng, Wei-Shi and Hu, Jian-Fang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}