UI-Venus leverages Reinforcement Fine-Tuning (RFT) to achieve state-of-the-art performance in GUI understanding and action prediction across mobile, desktop, and web interfaces.
π UI-Venus Benchmark Performance
Figure: Performance of UI-Venus across multiple benchmark datasets. UI-Venus achieves State-of-the-Art (SOTA) results on key UI understanding and interaction benchmarks, including ScreenSpot-Pro, ScreenSpot-v2, OS-World-G, UI-Vision, and Android World. The results demonstrate its superior capability in visual grounding, UI navigation, cross-platform generalization, and complex task reasoning.
We introduce UI-Venus, a state-of-the-art UI agent that pushes the boundaries of visual grounding and interactive navigation. Below are the core contributions and innovations:
We develop and open-source UI-Venus β a state-of-the-art UI agent trained on 350K high-quality, professionally annotated samples.
In addition to model checkpoints (7B & 72B), we release:
- πΉ Full evaluation pipeline
- πΉ Inference scripts
π GitHub Repository | Hugging Face Model
Beyond grounding, we design fine-grained, action-wise reward functions for reinforcement learning in GUI navigation tasks.
Our approach achieves:
- β Competitive performance on AndroidWorld
- β Better credit assignment in long-horizon tasks
- β End-to-end learnable action prediction with policy gradients
This marks a significant step toward autonomous UI interaction with precise feedback signals.
We conduct a systematic analysis of UI interaction data quality, identifying key challenges in real-world traces.
To improve training signal fidelity, we propose a three-stage data refinement pipeline:
- Prompt Rewrite β Clarify ambiguous user intentions
- Trace Editing β Fix incorrect or redundant actions
- Trace Generation β Synthesize missing steps via LLM-augmented modeling
π Result: RL & Higher-quality data β More robust and generalizable agents.
manually install the dependencies:
pip install -r requirements.txt
Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as ScreenSpot, including data format, annotation structure, and metric calculation.
-
For 7B model:
bash scripts/run_gd_7b.sh
-
For 72B model:
bash scripts/run_gd_72b.sh
π§ Configuration Required. Please set the following paths in your script:
screenspot_imgs
: Folder containing your screenshots (e.g., Screenspot-pro/images)screenspot_test
: Folder containing your all json files (e.g., Screenspot-pro/annotations)model_name_or_path
: Model checkpoint path (e.g., /root/ckpt/huggingface/)log_path
: Output folders
We provide example templates for input and output formats:
-
Input Format (Grounding Annotations):
examples/grounding_meta_format.json
β Defines the standard structure of annotation files inscreenspot_test
. -
Output Format (Model Predictions):
examples/grounding_result_format.json
β Shows the recommended format for model prediction results.
Please refer to these examples when preparing your data or parsing the evaluation outputs.
-
For 7B model:
bash scripts/run_navi_7b.sh
-
For 72B model:
bash scripts/run_navi_72b.sh
π§ Configuration Required. Please set the following paths in your script:
model_path
: Path to the model checkpoint (e.g., inclusionAI/UI-Venus-Navi-7B).input_file
: Input JSON file containing navigation tasks (str, default: examples/trace.json). Each entry must include:- task: The user instruction (string)
- image_path: Screenshot filepath
output_file
: Path to save the agentβs execution history (JSON format)
Model | Mobile Text | Mobile Icon | Desktop Text | Desktop Icon | Web Text | Web Icon | Avg. |
---|---|---|---|---|---|---|---|
uitars-1.5 | - | - | - | - | - | - | 94.2 |
Seed-1.5-VL | - | - | - | - | - | - | 95.2 |
GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 |
Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 |
UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 |
UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 |
LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 |
UI-Venus-Ground-7B (Ours) | 99.0 | 90.0 | 97.0 | 90.7 | 96.2 | 88.7 | 94.1 |
UI-Venus-Ground-72B (Ours) | 99.7 | 93.8 | 95.9 | 90.0 | 96.2 | 92.6 | 95.3 |
Performance comparison of GUI agent models across six task categories on ScreenSpot-Pro.
Scores are in percentage (%). T
= Text, I
= Icon.
*
: reproduced; β
: trained from UI-TARS-1.5-7B.
Model | CAD (T/I) | Dev (T/I) | Creative (T/I) | Scientific (T/I) | Office (T/I) | OS (T/I) | Avg T | Avg I | Overall | Type |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 2.0 / 0.0 | 1.3 / 0.0 | 1.0 / 0.0 | 2.1 / 0.0 | 1.1 / 0.0 | 0.0 / 0.0 | 1.3 | 0.0 | 0.8 | Closed |
Claude Computer Use | 14.5 / 3.7 | 22.0 / 3.9 | 25.9 / 3.4 | 33.9 / 15.8 | 30.1 / 16.3 | 11.0 / 4.5 | 23.4 | 7.1 | 17.1 | Closed |
UI-TARS-1.5 | β / β | β / β | β / β | β / β | β / β | β / β | β | β | 61.6 | Closed |
Seed1.5-VL | β / β | β / β | β / β | β / β | β / β | β / β | β | β | 60.9 | Closed |
Qwen2.5-VL-7B* | 16.8 / 1.6 | 46.8 / 4.1 | 35.9 / 7.7 | 49.3 / 7.3 | 52.5 / 20.8 | 37.4 / 6.7 | 38.9 | 7.1 | 26.8 | SFT |
Qwen2.5-VL-72B* | 54.8 / 15.6 | 65.6 / 16.6 | 63.1 / 19.6 | 78.5 / 34.5 | 79.1 / 47.2 | 66.4 / 29.2 | 67.3 | 25.0 | 51.2 | SFT |
UI-TARS-7B | 20.8 / 9.4 | 58.4 / 12.4 | 50.0 / 9.1 | 63.9 / 31.8 | 63.3 / 20.8 | 30.8 / 16.9 | 47.8 | 16.2 | 35.7 | SFT |
UI-TARS-72B | 18.8 / 12.5 | 62.9 / 17.2 | 57.1 / 15.4 | 64.6 / 20.9 | 63.3 / 26.4 | 42.1 / 15.7 | 50.9 | 17.6 | 38.1 | SFT |
Phi-Ground-7B | 26.9 / 17.2 | 70.8 / 16.7 | 56.6 / 13.3 | 58.0 / 29.1 | 76.4 / 44.0 | 55.1 / 25.8 | 56.4 | 21.8 | 43.2 | RL |
UI-TARS-1.5-7B | β / β | β / β | β / β | β / β | β / β | β / β | β | β | 49.6 | RL |
GTA1-7Bβ | 53.3 / 17.2 | 66.9 / 20.7 | 62.6 / 18.2 | 76.4 / 31.8 | 82.5 / 50.9 | 48.6 / 25.9 | 65.5 | 25.2 | 50.1 | RL |
GTA1-72B | 56.9 / 28.1 | 79.9 / 33.1 | 73.2 / 20.3 | 81.9 / 38.2 | 85.3 / 49.1 | 73.8 / 39.1 | 74.5 | 32.5 | 58.4 | RL |
UI-Venus-Ground-7B | 60.4 / 21.9 | 74.7 / 24.1 | 63.1 / 14.7 | 76.4 / 31.8 | 75.7 / 41.5 | 49.5 / 22.5 | 67.1 | 24.3 | 50.8 | Ours (RL) |
UI-Venus-Ground-72B | 66.5 / 29.7 | 84.4 / 33.1 | 73.2 / 30.8 | 84.7 / 42.7 | 83.1 / 60.4 | 75.7 / 36.0 | 77.4 | 36.8 | 61.9 | Ours (RL) |
π Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.9, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(70.4), AgentCPM(85), and UI-Vision(36.8), highlighting its effectiveness in complex visual grounding and action prediction tasks.
This is the compressed package of validation trajectories for AndroidWorld, including execution logs and navigation paths.
π₯ Download: UI-Venus-androidworld.zip
Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
---|---|---|---|---|
Closed-source Models | ||||
GPT-4o | β | β | β | 30.6 |
ScaleTrack | β | β | β | 44.0 |
SeedVL-1.5 | β | β | β | 62.1 |
UI-TARS-1.5 | β | β | β | 64.2 |
Open-source Models | ||||
GUI-Critic-R1-7B | β | β | β | 27.6 |
Qwen2.5-VL-72B* | β | β | β | 35.0 |
UGround | β | β | β | 44.0 |
Aria-UI | β | β | β | 44.8 |
UI-TARS-72B | β | β | β | 46.6 |
GLM-4.5v | β | β | β | 57.0 |
Ours | ||||
UI-Venus-Navi-7B | β | β | β | 49.1 |
UI-Venus-Navi-72B | β | β | β | 65.9 |
Table: Performance comparison on AndroidWorld for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.
Models | AndroidControl-Low Type Acc. |
AndroidControl-Low Step SR |
AndroidControl-High Type Acc. |
AndroidControl-High Step SR |
GUI-Odyssey Type Acc. |
GUI-Odyssey Step SR |
---|---|---|---|---|---|---|
Closed-source Models | ||||||
GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
Open Source Models | ||||||
Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
Aguvis-7B | - | 80.5 | - | 61.5 | - | - |
Aguvis-72B | - | 84.4 | - | 66.4 | - | - |
OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
UI-TARS-7B | 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
UI-TARS-72B | 98.1 | 91.3 | 85.2 | 74.7 | 95.4 | 88.6 |
GUI-R1-7B | 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
Ours | ||||||
UI-Venus-Navi-7B | 97.1 | 92.4 | 86.5 | 76.1 | 87.3 | 71.5 |
UI-Venus-Navi-72B | 96.7 | 92.9 | 85.9 | 77.2 | 87.2 | 72.4 |
Table: Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.
Please consider citing if you find our work useful:
@misc{gu2025uivenustechnicalreportbuilding,
title={UI-Venus Technical Report: Building High-performance UI Agents with RFT},
author={Zhangxuan Gu and Zhengwen Zeng and Zhenyu Xu and Xingran Zhou and Shuheng Shen and Yunfei Liu and Beitong Zhou and Changhua Meng and Tianyu Xia and Weizhi Chen and Yue Wen and Jingya Dou and Fei Tang and Jinzhen Lin and Yulin Liu and Zhenlin Guo and Yichen Gong and Heng Jia and Changlong Gao and Yuan Guo and Yong Deng and Zhenyu Guo and Liang Chen and Weiqiang Wang},
year={2025},
eprint={2508.10833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.10833},
}