Building on the foundation of the Ovis series, Ovis-U1 is a 3-billion-parameter unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
The overall architecture of Ovis-U1 (cf. Fig.2 in our report).
- Unified Capabilities: A single model excels at three core tasks: understanding complex scenes, generating images from text, and performing precise edits based on instructions.
- Advanced Architecture: Ovis-U1 features a powerful diffusion-based visual decoder (MMDiT) and a bidirectional token refiner, enabling high-fidelity image synthesis and enhanced interaction between text and vision.
- Synergistic Unified Training: Unlike models trained on single tasks, Ovis-U1 is trained on a diverse mix of understanding, generation, and editing data simultaneously. Our findings show that this approach achieves improved generalization, seamlessly handling real-world multimodal challenges with high accuracy.
- State-of-the-Art Performance: Ovis-U1 achieves leading scores on multiple academic benchmarks, surpassing strong contemporary models in multimodal understanding (69.6 on OpenCompass), generation (83.72 on DPG-Bench), and editing (4.00 on ImgEdit-Bench).
Here are some examples demonstrating the capabilities of Ovis-U1.
Ovis-U1 has been tested with Python 3.10, Torch 2.4.0, Transformers 4.51.3, and DeepSpeed 0.15.4. For a full list of package dependencies, please see requirements.txt
.
git clone git@github.com:AIDC-AI/Ovis-U1.git
conda create -n ovis-u1 python=3.10 -y
conda activate ovis-u1
cd Ovis-U1
pip install -r requirements.txt
pip install -e .
We provide simple scripts to test the different capabilities of Ovis-U1.
For single image understanding, please run
python test_img_to_txt.py
For multi-image understanding, please run
python test_multi_img_to_txt.py
For text-to-image, please run
python test_txt_to_img.py \
--height 1024 \
--width 1024 \
--steps 50 \
--seed 42 \
--txt_cfg 5
For image editing, please run
python test_img_edit.py \
--steps 50 \
--img_cfg 1.5 \
--txt_cfg 6
Alternatively, you can try Ovis-U1 directly in your browser on
Model | Avg | MMB | MMS | MMMU | MathVista | Hallusion | AI2D | OCRBench | MMVet |
---|---|---|---|---|---|---|---|---|---|
GPT-4o | 75.4 | 86 | 70.2 | 72.9 | 71.6 | 57 | 86.3 | 82.2 | 76.9 |
InternVL2.5-2B | 59.9 | 70.9 | 54.3 | 43.2 | 51.1 | 42.3 | 74.9 | 80.2 | 62.6 |
SAIL-VL-2B | 61 | 73.7 | 56.5 | 44.1 | 62.8 | 45.9 | 77.4 | 83.1 | 44.2 |
InternVL3-2B | 61.1 | 78 | 61.1 | 48.7 | 57.6 | 41.9 | 78.6 | 83.1 | 67 |
Qwen2.5-VL-3B | 64.5 | 76.8 | 56.3 | 51.2 | 61.2 | 46.6 | 81.4 | 82.8 | 60 |
Ovis2-2B | 65.2 | 76.9 | 56.7 | 45.6 | 64.1 | 50.2 | 82.7 | 87.3 | 58.3 |
SAIL-VL-1.5-2B | 67 | 78.5 | 62.6 | 46.4 | 67 | 50 | 83.7 | 89.1 | 58.8 |
Ristretto-3B | 67.7 | 80.2 | 62.8 | 51.3 | 67.6 | 50.2 | 84.2 | 84.7 | 60.7 |
Ovis-U1 | 69.6 | 77.8 | 61.3 | 51.1 | 69.4 | 56.3 | 85.6 | 88.3 | 66.7 |
Model | Overall | Single object | Two object | Counting | Colors | Position | Attribute binding |
---|---|---|---|---|---|---|---|
GPT-4o | 0.84 | 0.99 | 0.92 | 0.85 | 0.92 | 0.75 | 0.61 |
BAGEL | 0.82 | 0.99 | 0.94 | 0.81 | 0.88 | 0.64 | 0.63 |
BAGEL π | 0.88 | 0.98 | 0.95 | 0.84 | 0.95 | 0.78 | 0.77 |
UniWorld-V1 | 0.80 | 0.99 | 0.93 | 0.79 | 0.89 | 0.49 | 0.70 |
UniWorld-V1 π | 0.84 | 0.98 | 0.93 | 0.81 | 0.89 | 0.74 | 0.71 |
OmniGen | 0.68 | 0.98 | 0.84 | 0.66 | 0.74 | 0.40 | 0.43 |
OmniGen2 | 0.80 | 1 | 0.95 | 0.64 | 0.88 | 0.55 | 0.76 |
OmniGen2 π | 0.86 | 0.99 | 0.96 | 0.74 | 0.98 | 0.71 | 0.75 |
Ovis-U1 | 0.89 | 0.98 | 0.98 | 0.90 | 0.92 | 0.79 | 0.75 |
π denotes using the rewritten prompts
Model | Overall | Global | Entity | Attribute | Relation | Other |
---|---|---|---|---|---|---|
BAGEL | 85.07 | 88.94 | 90.37 | 91.29 | 90.82 | 88.67 |
UniWorld-V1 | 81.38 | 83.64 | 88.39 | 88.44 | 89.27 | 87.22 |
OmniGen | 81.16 | 87.90 | 88.97 | 88.47 | 87.95 | 83.56 |
OmniGen2 | 83.57 | 88.81 | 88.83 | 90.18 | 89.37 | 90.27 |
Ovis-U1 | 83.72 | 82.37 | 90.08 | 88.68 | 93.35 | 85.20 |
Model | Overall | Add | Adjust | Extract | Replace | Remove | Background | Style | Hybrid | Action |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 4.2 | 4.61 | 4.33 | 2.9 | 4.35 | 3.66 | 4.57 | 4.93 | 3.96 | 4.89 |
MagicBrush | 1.90 | 2.84 | 1.58 | 1.51 | 1.97 | 1.58 | 1.75 | 2.38 | 1.62 | 1.22 |
Instruct-P2P | 1.88 | 2.45 | 1.83 | 1.44 | 2.01 | 1.50 | 1.44 | 3.55 | 1.2 | 1.46 |
AnyEdit | 2.45 | 3.18 | 2.95 | 1.88 | 2.47 | 2.23 | 2.24 | 2.85 | 1.56 | 2.65 |
UltraEdit | 2.7 | 3.44 | 2.81 | 2.13 | 2.96 | 1.45 | 2.83 | 3.76 | 1.91 | 2.98 |
OmniGen | 2.96 | 3.47 | 3.04 | 1.71 | 2.94 | 2.43 | 3.21 | 4.19 | 2.24 | 3.38 |
Step1X-Edit | 3.06 | 3.88 | 3.14 | 1.76 | 3.40 | 2.41 | 3.16 | 4.63 | 2.64 | 2.52 |
ICEdit | 3.05 | 3.58 | 3.39 | 1.73 | 3.15 | 2.93 | 3.08 | 3.84 | 2.04 | 3.68 |
BAGEL | 3.2 | 3.56 | 3.31 | 1.7 | 3.3 | 2.62 | 3.24 | 4.49 | 2.38 | 4.17 |
UniWorld-V1 | 3.26 | 3.82 | 3.64 | 2.27 | 3.47 | 3.24 | 2.99 | 4.21 | 2.96 | 2.74 |
OmniGen2 | 3.44 | 3.57 | 3.06 | 1.77 | 3.74 | 3.2 | 3.57 | 4.81 | 2.52 | 4.68 |
Ovis-U1 | 4.00 | 4.13 | 3.62 | 2.98 | 4.45 | 4.06 | 4.22 | 4.69 | 3.45 | 4.61 |
Model | Avg | Background Change | Color Alteration | Material Modification | Motion Change | Portrait Beautification | Style Transfer | Subject Addition | Subject Removal | Subject Replacement | Text Modification | Tone Transformation |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 7.534 | 7.205 | 6.491 | 6.607 | 8.096 | 7.768 | 6.961 | 7.622 | 8.331 | 8.067 | 7.427 | 8.301 |
AnyEdit | 3.212 | 4.663 | 4.260 | 2.537 | 2.024 | 3.479 | 2.032 | 3.995 | 3.089 | 3.180 | 0.922 | 5.151 |
Instruct-Pix2Pix | 3.684 | 3.825 | 5.182 | 3.688 | 3.509 | 4.339 | 4.560 | 3.461 | 2.031 | 4.237 | 0.955 | 4.733 |
MagicBrush | 4.518 | 5.637 | 5.136 | 5.078 | 4.513 | 4.487 | 4.439 | 5.252 | 3.704 | 4.941 | 1.384 | 5.130 |
OmniGen | 5.062 | 5.281 | 6.003 | 5.308 | 2.916 | 3.087 | 4.903 | 6.628 | 6.352 | 5.616 | 4.519 | 5.064 |
Gemini | 6.315 | 6.781 | 6.369 | 6.040 | 6.938 | 5.591 | 4.676 | 7.501 | 6.447 | 7.003 | 5.765 | 6.350 |
Step1X-Edit | 6.701 | 6.547 | 6.545 | 6.204 | 6.483 | 6.787 | 7.221 | 6.975 | 6.512 | 7.068 | 6.921 | 6.448 |
Doubao | 6.754 | 7.430 | 7.095 | 6.339 | 6.973 | 6.972 | 6.767 | 7.674 | 6.748 | 7.447 | 3.471 | 7.383 |
BAGEL | 6.519 | 7.324 | 6.909 | 6.381 | 4.753 | 4.573 | 6.150 | 7.896 | 7.164 | 7.021 | 7.320 | 6.218 |
Ovis-U1 | 6.420 | 7.486 | 6.879 | 6.208 | 4.790 | 5.981 | 6.463 | 7.491 | 7.254 | 7.266 | 4.482 | 6.314 |
- Note that the leaderboard has been updated by this commit. The results shown here are from an earlier version.
If you find Ovis-U1 useful for your research or applications, please cite our technical report:
@article{wang2025ovisu1,
title={Ovis-U1 Technical Report},
author={Wang, Guo-Hua and Zhao, Shanshan and Zhang, Xinjie and Cao, Liangfu and Zhan, Pengxin and Duan, Lunhao and Lu, Shiyin and Fu, Minghao and Zhao, Jianshan and Li, Yang and Chen, Qing-Guo},
journal={arXiv preprint arXiv:2506.23044},
year={2025}
}
The code is built upon Ovis and FLUX. We thank their authors for open-sourcing their great work.
This project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0).
We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.
We are looking for both interns and full-time researchers to join our team, focusing on multimodal understanding, generation, reasoning, AI agents, and unified multimodal models. If you are interested in exploring these exciting areas, please reach out to us at qingguo.cqg@alibaba-inc.com.