GENIUS

This repo contains the codebase for the CVPR 2025 paper "GENIUS: A Generative Framework for Universal Multimodal Search"

Notice (July 5th Update)

Several latent bugs introduced during earlier refactoring have been fixed, including a critical dimension mismatch issue in retriever.py during the decoder training phase. Specifically, mismatched token embeddings caused by misaligned special tokens have been resolved.

In addition, we have:

Updated and explicitly specified library versions in the conda environment to improve reproducibility
Fixed minor path inconsistencies and typos in training scripts
Other minor bugs and inconsistencies that may have caused silent failures or instability

🔁 Please make sure to git pull the latest version and re-create your conda environment accordingly.

Introduction

We propose GENIUS, a universal generative retrieval framework that supports diverse tasks across multiple modalities. By learning discrete, modality‐decoupled IDs via semantic quantization, GENIUS encodes multimodal data into compact identifiers and performs constant‐time retrieval with competitive accuracy.

✨ Key Advantages

Universal Retrieval
Single model handles various retrieval tasks including image‐to‐image, text‐to‐text, image‐to‐text, text‐to‐image, and their combinations.
Fast Retrieval
Constant‐time lookup via discrete ID matching, independent of candidate pool size.
Competitive Accuracy
Comparable to—and sometimes better than—embedding‐based methods, while significatnly reducing inference cost.

Overview

GENIUS consists of three key components that work together in a three-stage training pipeline:

Multimodal Encoder (CLIP-SF)
Encodes joint image/text features using a shared backbone. We leverage UniIR's score‐fusion CLIP model to learn cross‐modal relations without extra pretraining, with pretrained checkpoints available on Hugging Face.
Modality-Decoupled Quantizer
Quantize continuous embeddings into discrete, layered ID codes including modality and semantic information. Through residual quantization training, it learns to encode both images and text into layered, discrete IDs:
- First code: modality indicator (0 = image, 1 = text, 2 = image‐text)
- Subsequent codes: semantic features (objects → attributes → context)

ID Generator
Sequence‐to‐sequence decoder that generates target IDs based on query embeddings. During training, it learns to predict discrete IDs from various input queries (images, text, or pairs with instruction). We employ query augmentation (query-target mixing) to improve generalization and incorporate constrained beam search during decoding to enforce valid ID sequences.

Installation

Clone the repository and create the Conda environment:

git clone https://github.com/sung-yeon-kim/GENIUS-CVPR25.git
cd GENIUS-CVPR25
conda env create -f genius_env.yml

M-BEIR

We utilize M-BEIR (Multimodal BEnchmark for Instructed Retrieval) for training and evaluating our universal multimodal retrieval models. This large-scale benchmark enables comprehensive assessment of model performance across various retrieval tasks.

Downloading M-BEIR

The M-BEIR dataset is available on Hugging Face. To download and prepare the dataset:

Set up Git LFS (Large File Storage):

git lfs install

Clone the dataset repository:

git clone https://huggingface.co/datasets/TIGER-Lab/M-BEIR

The dataset will be used for both training and evaluation phases of GENIUS.

Usage

Multimodal Encoder Feature Extraction (Stage 0)

We utilize UniIR's score fusion model as a replacement for the encoder pretraining stage.

1. Download Pretrained CLIP-SF

mkdir -p checkpoint/CLIP_SF
wget https://huggingface.co/TIGER-Lab/UniIR/resolve/main/checkpoint/CLIP_SF/clip_sf_large.pth -O checkpoint/CLIP_SF/clip_sf_large.pth

2. Feature Extraction

Training Data

Extracts CLIP features for training set → extracted_embed/CLIP_SF/train.

# Navigate to feature extraction directory
cd feature_extraction

# Run feature extraction for training data
bash run_feature_extraction_train.sh

Candidate Pool

Extracts CLIP features for the retrieval candidate pool → extracted_embed/CLIP_SF/cand.

# Run feature extraction for candidate pool
bash run_feature_extraction_cand.sh

Residual Quantization (Stage 1)

cd models/residual_quantization
vim configs_scripts/large/train/inbatch/inbatch.yaml  # Edit config like data path, batch size, etc.
bash configs_scripts/large/train/inbatch/run_inbatch.sh

Generator Training (Stage 2)

cd models/generative_retriever
vim configs_scripts/large/train/inbatch/inbatch.yaml  # Edit config like data path, batch size, etc.
bash configs_scripts/large/train/inbatch/run_inbatch.sh

Inference

Extract CLIP features for candidate pool (if not already done):

cd feature_extraction
bash run_feature_extraction_cand.sh

Compile trie_cpp (recommended for faster inference):

cd models/generative_retriever
c++ -O3 -Wall -shared -std=c++17 -fPIC \
    $(python3 -m pybind11 --includes) \
    trie_cpp.cpp -o trie_cpp$(python3-config --extension-suffix)

Run evaluation:

cd models/generative_retriever
bash configs_scripts/large/eval/inbatch/run_eval.sh

For inference, you can choose between three trie implementations: trie_cpp (fastest), trie (Python), marisa (alternative).

Model Checkpoints

We provide model checkpoints for GENIUS in the 🤗 Hugging Face:

How to Download

# Download the CLIP-SF model (Stage 0)
wget https://huggingface.co/TIGER-Lab/UniIR/resolve/main/checkpoint/CLIP_SF/clip_sf_large.pth -O checkpoint/CLIP_SF/clip_sf_large.pth

# Clone the GENIUS checkpoints (Stage 1 and 2)
git clone https://huggingface.co/Sungyeon/GENIUS

Each Component Checkpoints

CLIP-SF Model (Stage 0): clip_sf_large.pth
Residual Quantization Model (Stage 1): rq_clip_large.pth
Generator Model (Stage 2): GENIUS_t5small.pth

Note: All three models are required for full functionality.

📈 Performance

The results in parentheses denote scores from our reimplemented checkpoints, as the originals were lost during server migration. While close to the paper, slight variations may occur due to retraining randomness.

Universal Information Retrieval

Task	Dataset	CLIP_SF	BLIP_FF	GENIUS (checkpoint)	GENIUSᴿ (checkpoint)
T→I	VisualNews	42.6	23.0	18.5 (18.5)	27.3 (27.3)
	MSCOCO	77.9	75.6	55.1 (55.3)	68.0 (68.0)
	Fashion200K	17.8	25.4	13.7 (14.0)	16.2 (15.9)
T→T	WebQA	84.7	79.5	31.1 (31.9)	42.9 (43.6)
T→(I,T)	EDIS	59.4	50.3	36.6 (37.0)	44.1 (44.1)
	WebQA	78.8	79.7	49.0 (49.0)	59.7 (59.3)
I→T	VisualNews	42.8	21.1	18.4 (18.2)	26.8 (26.8)
	MSCOCO	92.3	88.8	82.7 (83.0)	90.6 (90.7)
	Fashion200K	17.9	27.6	12.8 (12.9)	16.2 (16.6)
I→I	NIGHTS	33.4	33.0	8.1 (8.1)	30.2 (30.0)
	OVEN	39.2	34.7	34.6 (34.5)	38.0 (38.0)
(I,T)→T	InfoSeek	24.0	19.7	10.4 (10.5)	18.0 (18.0)
(I,T)→I	FashionIQ	26.2	28.5	13.1 (13.1)	19.2 (19.3)
	CIRR	43.0	51.4	20.1 (20.1)	38.3 (38.1)
(I,T)→(I,T)	OVEN	60.2	57.8	36.5 (36.6)	48.6 (48.3)
	InfoSeek	44.6	27.7	14.2 (14.3)	28.6 (28.7)

⚡ Efficiency

When the candidate pool grows, embedding‐based retrieval (e.g., CLIP + nearest neighbors) slows down dramatically. In contrast, GENIUS's discrete ID generation is nearly constant time. Empirically, GENIUS is roughly 4× faster than competing generative methods like GRACE.

Citation

If you find this work useful, please cite:

@inproceedings{kim2025genius,
   title={GENIUS: A Generative Framework for Universal Multimodal Search},
   author={Kim, Sungyeon and Zhu, Xinliang and Lin, Xiaofan and Bastan, Muhammet and Gray, Douglas and Kwak, Suha },
   booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
   year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
checkpoint		checkpoint
extracted_embed/CLIP_SF		extracted_embed/CLIP_SF
misc		misc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
genius_env.yml		genius_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GENIUS

Notice (July 5th Update)

Introduction

✨ Key Advantages

Overview

Installation

M-BEIR

Downloading M-BEIR

Usage

Multimodal Encoder Feature Extraction (Stage 0)

1. Download Pretrained CLIP-SF

2. Feature Extraction

Training Data

Candidate Pool

Residual Quantization (Stage 1)

Generator Training (Stage 2)

Inference

Model Checkpoints

How to Download

Each Component Checkpoints

📈 Performance

Universal Information Retrieval

⚡ Efficiency

Citation

License

Acknowledgments & References

Codebases

Related Papers

About

Uh oh!

Releases

Packages

Languages

License

sung-yeon-kim/GENIUS-CVPR25

Folders and files

Latest commit

History

Repository files navigation

GENIUS

Notice (July 5th Update)

Introduction

✨ Key Advantages

Overview

Installation

M-BEIR

Downloading M-BEIR

Usage

Multimodal Encoder Feature Extraction (Stage 0)

1. Download Pretrained CLIP-SF

2. Feature Extraction

Training Data

Candidate Pool

Residual Quantization (Stage 1)

Generator Training (Stage 2)

Inference

Model Checkpoints

How to Download

Each Component Checkpoints

📈 Performance

Universal Information Retrieval

⚡ Efficiency

Citation

License

Acknowledgments & References

Codebases

Related Papers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages