Official implementation of the paper "LangBridge: Interpreting Image as a Combination of Language Embeddings" accepted at ICCV 2025.
- [2025-06] LangBridge paper accepted at ICCV 2025!
- [2025-06] Code and models released!
We propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This design enables pretraining-free adapter transfer across different LLMs while maintaining competitive performance.
We use CUDA 11.8
git clone https://github.com/CurryX-001/LangBridge.git
cd LangBridge
# Create environment
conda create -n langbridge python=3.10 -y
conda activate langbridge
pip install --upgrade pip
# Install package
pip install -e .
# Install additional packages for training
pip install -e ".[train]"
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation --no-cache-dir
Download the annotation file for final mixture instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- LLaVA-Pretrain: images
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
:
├── coco
│ └── train2017
├── LLaVA-Pretrain
│ └── images
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
You can download training data using the provided script:
bash scripts/download_data.sh
bash scripts/vis.sh
Extract input embeddings from pretrained models and process them with vocabulary mappings:
# Extract embeddings for Llama3-8B with 19200 vocab
python scripts/get_input_embeddings.py \
--model_name "meta-llama/Meta-Llama-3-8B-Instruct" \
--vocab_path "vocab/19200_llama3_sub_llava_share_intersect_llama_qwen.json" \
--output_dir "./embeddings"
# Extract embeddings for Qwen2-7B with 19200 vocab
python scripts/get_input_embeddings.py \
--model_name "Qwen/Qwen2-7B-Instruct" \
--vocab_path "vocab/19200_Qwen_sub_llava_share_intersect_llama_qwen.json" \
--output_dir "./embeddings"
To generate different vocabulary sizes, use:
# Create vocabulary subsets with different sizes
python scripts/create_vocab_subset.py --vocab_size 19200 --model_name llama3
python scripts/create_vocab_subset.py --vocab_size 25600 --model_name llama3
python scripts/create_vocab_subset.py --vocab_size 32000 --model_name llama3
python scripts/create_vocab_subset.py --vocab_size 19200 --model_name Qwen
python scripts/create_vocab_subset.py --vocab_size 25600 --model_name Qwen
python scripts/create_vocab_subset.py --vocab_size 32000 --model_name Qwen
Train the LangBridge model using the provided training scripts:
# For Llama3-based models
bash scripts/examples/llama3/train_langbridge.sh
# Example training configurations available:
# - scripts/examples/llama3/pretrain.sh
# - scripts/examples/llama3/finetune.sh
# - scripts/examples/llama3/multimodel_training.sh
For detailed training configurations and advanced options, refer to the example scripts in scripts/examples/llama3/
.
Evaluate trained models across multiple benchmarks:
bash scripts/evaluate_all.sh
For LLaVA-Next specific training and evaluation protocols, refer to ./LLaVA-NeXT/Instruction.md
.
Pre-trained models are available for download:
LLM | Connector | Model Type | Download |
---|---|---|---|
Qwen2-7B | Qwen2-7B-Pretrain-MLP | LLaVA-Next | ModelScope |
Qwen2-7B | Qwen2-0.5B-Pretrain-LangBridge | LLaVA-Next | ModelScope |
@article{liao2025langbridge,
title={LangBridge: Interpreting Image as a Combination of Language Embeddings},
author={Liao, Jiaqi and Niu, Yuwei and Meng, Fanqing and Li, Hao and Tian, Changyao and Du, Yinuo and Xiong, Yuwen and Li, Dianqi and Zhu, Xizhou and Yuan, Li and others},
journal={arXiv preprint arXiv:2503.19404},
year={2025}
}
For questions, please open an issue or contact: godubnation7@gmail.com
LangBridge is built on LlaVA, LlaVA-NeXT, and lmms-eval. We thank the authors for their excellent work and open-source contributions.