AutoSteer is a plug-and-play safety steering framework for multimodal large language models (MLLMs), designed to reduce harmful outputs during inference through steer matrix training, prober evaluation, and model output adjustment.
Chameleon
Create the Conda environment from the provided configuration:
conda env create -f chameleon_environment.yml
conda activate ANOLE
pip install -r chameleon_requirements.txt
Download the model: Anole or Chameleon
git lfs install
git clone https://huggingface.co/GAIR/Anole-7b-v0.1
or
huggingface-cli download --resume-download GAIR/Anole-7b-v0.1 --local-dir Anole-7b-v0.1 --local-dir-use-symlinks False
Llava-OneVision
Create the Conda environment from the provided configuration:
conda env create -f llava_environment.yml
conda activate llava
pip install -r llava_requirements.txt
Use the following code snippet to download the model:
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration, LlavaOnevisionConfig
import torch
cache_path = "your/model/path/llava-next-8b"
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", cache_dir=cache_path)
config = LlavaOnevisionConfig.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", cache_dir=cache_path)
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
"llava-hf/llava-onevision-qwen2-7b-ov-hf",
config=config,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="cuda:1",
cache_dir=cache_path
)
You can download the pretrained Steer Matrix and Prober checkpoints used in our experiments. If you prefer, you may also train them from scratch by yourself. Instructions can be found in below pipelines.
-
COCO train2017
Download and place in:AutoSteer_final/dataset/COCO
-
COCO2014
Goto link to download and place in:AutoSteer_final/dataset/COCO2014
-
NSFW-test-porn
Download the NSFW Image Classification dataset from Kaggle. Use the "porn" class from the test set. Replace the empty directory:AutoSteer_final/dataset/ToViLaG/porn
-
UCLA-protest
Request from here based on:Won et al., Protest Activity Detection and Perceived Violence Estimation from Social Media Images, ACM Multimedia 2017.
Use the "protest" class from both train and test sets. Rename and place into:AutoSteer_final/dataset/ToViLaG/protest
-
Bloody Images
Contact the ToViLaG Author to obtain bloody images. Replace:AutoSteer_final/dataset/ToViLaG/bloody
Edit AutoSteer_final/source/steer/SteerChameleon/constants.py
:
ckpt_path = "ANOLE/Anole-7b-v0.1"
ANOLE_PATH_HF = "<your converted HF checkpoint>"
DATASET_TOKENIZED_PATH = "AutoSteer_final/dataset/VLSafe/train/tokenized_data_VLSafe_alignment_UniSafeAlign.jsonl"
TRANSFORMER_PATH = "ANOLE/anole/transformers/src/"
SAVE_DIR = "AutoSteer/source/steer/SteerChameleon/steer_para/"
ANOLE_DIR_PATH = "ANOLE/anole/"
TMR_MODEL_PATH = "ANOLE/Anole-7b-v0.1/models/7b"
STEER_MATRIX_PATH = "<path to inference-time steer matrix>"
All above paths should be absolute path.
Edit AutoSteer_final/source/LayerSelect/Chameleon/constants.py
:
ANOLE_DIR_PATH = "ANOLE/anole/",
ckpt_path = "ANOLE/Anole-7b-v0.1"
All above paths should be absolute path.
Edit AutoSteer_final/source/steer/SteerLlava/constants.py
:
model_path = "your/model/path/llava-next-8b"
SAVE_DIR = "AutoSteer/source/steer/SteerLlava/steer_para/"
toxic_PIC_dataset_pth = "AutoSteer_final/dataset/ToViLaG/Mono_NontoxicText_ToxicImg_1000Samples_porn_bloody_train.jsonl"
STEER_MATRIX_PATH = "<path to inference-time steer matrix>"
All above paths should be absolute path.
The following pipelines cover the full process, including steer matrix training, layer selection, prober training, and evaluation.
You may customize the execution based on your needs.
If you plan to use our pretrained steer matrix and prober checkpoints,
you can skip the following training-related scripts:
-
For Chameleon:
bash train_chameleon_steer.bash
bash chameleon_prober_pipeline.bash
-
For Llava-OneVision:
bash train_llava_steer.bash
bash llava_prober_pipeline.bash
You can directly run the test_*.bash
scripts to evaluate detoxification and general capabilities.
conda activate ANOLE
cd AutoSteer_final/source
bash train_chameleon_steer.bash
bash select_layer_chameleon.bash
bash chameleon_prober_pipeline.bash
bash test_chameleon.bash
conda activate llava
cd AutoSteer_final/source
bash train_llava_steer.bash
bash select_layer_llava.bash
bash llava_prober_pipeline.bash
bash test_llava.bash
AutoSteer_final/
βββ dataset/
β βββ COCO/
β βββ COCO2014/
β βββ MMMU/
β βββ RQA/
β βββ VLSafe/
β β βββ train/
β βββ ToViLaG/
β βββ porn/
β βββ protest/
β βββ bloody/
βββ source/
β βββ steer/
β β βββ SteerChameleon/
β β βββ SteerLlava/
β β βββ ...
β βββ LayerSelect/
β β βββ Chameleon/
β β βββ Llava/
β β βββ ...
β βββ logs/
β β βββ ...
βββ model_output/
β β βββ ...
For any questions or issues, feel free to open an issue or contact us via lyuchengwu@zju.edu.cn / lyuchengwu@u.nus.edu or shumin@nus.edu.sg / 231sm@zju.edu.cn.
π Thank you very much for your interest in our work. If you use or extend our work, please cite the following paper:
@inproceedings{EMNLP2025_AutoSteer,
author = {Lyucheng Wu and
Mengru Wang and
Ziwen Xu and
Tri Cao and
Nay Oo and
Bryan Hooi and
Shumin Deng},
title = {Automating Steering for Safe Multimodal Large Language Models},
booktitle = {{EMNLP} {(1)}},
publisher = {Association for Computational Linguistics},
year = {2025},
url = {https://arxiv.org/abs/2507.13255}
}
We sincerely thank Xinpeng Wang and Donghyeon Won for providing parts of the datasets used in this project, which originate from their works: ToViLaG and protest-detection-violence-estimation.
We also gratefully acknowledge Dr. Peixuan Han for his open-source prober implementation from work SafeSwitch, which served as a valuable reference during our development.
This project is released under the MIT License. See LICENSE
for details.