AI Vision Companion as F5-TTS fork for convenience.

This repository features an AI vision companion/assistant that merges visual input capture with audio transcription and synthesis through various APIs and libraries. The script detects microphone input, transcribes it, processes vision input from the specified window, creates very detailed caption with Florence-2, and produces responses using a Large Language Model (OpenAI API) and Text-To-Speech (F5-TTS).

Features

Near real-time interaction.
Multiple monitor support.
Captures and processes vision locally from a specified window.
Transcribes audio input locally using Whisper-Large-3-Turbo model.
Synthesizes responses locally using F5 text-to-speech model.
Support for GPU acceleration using CUDA.

Installation

Prerequisites

Windows OS
Python 3.10 or higher
CUDA-compatible GPU (NVIDIA RTX 3090/4090 recommended for faster processing)
Microphone set as the default input device in the system settings.

Requirements

The following environment configuration was used for testing: Windows 10 Pro x64, Python 3.10.11 64-bit, and CUDA 11.8.

Create a python 3.10 conda env (you could also use virtualenv)

conda create -n visioncompanion python=3.10
conda activate visioncompanion

First make sure you have git-lfs installed (https://git-lfs.com)

git lfs install

Install F5-TTS using the command below. You can install F5-TTS in a different directory using cd command (e.g., cd c:).

git clone https://github.com/Vinventive/F5-TTS-AIVisionCompanion.git
cd F5-TTS-AIVisionCompanion
pip install -e .

Download the F5 model_1200000.safetensors model checkpoint and place it inside ckpts folder

(e.g., C:/F5-TTS-AIVisionCompanion/ckpts/F5TTS_Base/model_1200000.safetensors):

https://huggingface.co/SWivid/F5-TTS

Install torch with your CUDA version, e.g. :

pip install torch==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install torchvision==0.18.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Install the required libraries using pip:

pip install -r requirements.txt

Required Environment Variables

Rename the .env.example file to .env and keep it in the root directory of the project. Fill in the missing OPENAI_API_KEY=xxxxxxxx... variable with your own API key, voice sample and voice sample transcription you can leave unchanged. Use 2-15 seconds long voice sample in .wav format. If you swap a voice sample make sure to update filename in F5_TTS_SAMPLE and write an audio transcription for the new sample in F5_TTS_REF_TEXT.

OPENAI_API_KEY=your_openai_api_key

F5_TTS_PATH="./..."
F5_TTS_SAMPLE="./sample.wav"
F5_TTS_REF_TEXT="How introverts make friends?"

VISION_KEYWORDS=scene,sight,video,frame,activity,happen,going

Usage

1. Run the main script.

python visioncompanion.py

When running the script for the first time, it might take a while to download the speech recognition model faster-whisper-large-v3 as well as image captioning model florence-2-base-ft for local use.

2. Choose if you want to launch the app with vision capture preview window (`y` for yes or `n` for no).

Using cuda:0 device
Loading Florence-2 model...
Florence-2 model loaded.
Do you want to launch the application with the preview window? (y/n):

2. Type the window title.

Do you want to launch the application with the preview window? (y/n): y
Enter the title of the window:

The script will prompt you to enter the title of the window you want to capture. You can specify a window by typing a simple keyword like calculator or minecraft etc. Searching process is not case-sensitive and will look for windows containing the provided keyword, even if the keyword is an incomplete word, prefix, or suffix. If you want to capture the view from your web browser's window and switch between different tabs, you can use simple keywords like chrome or firefox etc. In case you have multiple instances of an app open on multiple displays and you want to specify the tab use keywords like youtube or twitch etc.

Make sure the captured window is always in the foreground and active - not minimized or in the background. (In case the window is minimized, the script will attempt to maximize it.)

3. Wait until the vision capture completes collecting the initial sequence of frames and speech recognition becomes active.

Enter the title of the window: youtube
Window title set to: youtube
Starting continuous audio recording...

4. Start by speaking into your microphone :)

Starting continuous audio recording...
User: Hi there, how are you doing?

5. Seamlessly talk about your view by naturally using vision-activating keywords during the conversation.

User: Hi there, how are you doing?
Assistant: Just hanging out, enjoying the vibes.
Converting audio...
User:Can you describe what you can see on the screen?
Assistant: There's an anime girl with long blonde hair, looking stylish in a red dress.
Converting audio...
User: How can I say it in Spanish?
Assistant: "Chica con cabello rubio y vestido rojo." Simple and stylish!
Converting audio...

Keywords analyzing the sequence of the last 10 seconds.

"scene", "sight",  "video", "frame", "activity", "happen", "going"

You have the option to add or remove keywords in the .env file.

If you receive response: Assistant: Error in generating response make sure to update OpenAI API key inside .env. Your API key might be incorrect or missing - this variable cannot be empty.

Acknowledgements

F5-TTS Original Repostory for F5-TTS (source code)
AIVisionCompanion Original Repository for AIVisionCompanion.
New Test Voice Sample was generated with AI (feel free to use for testing)
E2-TTS brilliant work, simple and effective
Emilia, WenetSpeech4TTS valuable datasets
lucidrains initial CFM structure with also bfs18 for discussion
SD3 & Hugging Face diffusers DiT and MMDiT code structure
torchdiffeq as ODE solver, Vocos as vocoder
FunASR, faster-whisper, UniSpeech for evaluation tools
ctc-forced-aligner for speech edit test
mrfakename huggingface space demo ~
f5-tts-mlx Implementation with MLX framework by Lucas Newman
F5-TTS-ONNX ONNX Runtime version by DakeQQ

Citation

If our work and codebase is useful for you, please cite as:

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.

Name		Name	Last commit message	Last commit date
Latest commit History 410 Commits
.github		.github
ckpts		ckpts
data		data
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
sample.wav		sample.wav
visioncompanion.py		visioncompanion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Vision Companion as F5-TTS fork for convenience.

Features

Installation

Prerequisites

Requirements

Required Environment Variables

Usage

1. Run the main script.

2. Choose if you want to launch the app with vision capture preview window (`y` for yes or `n` for no).

2. Type the window title.

3. Wait until the vision capture completes collecting the initial sequence of frames and speech recognition becomes active.

4. Start by speaking into your microphone :)

5. Seamlessly talk about your view by naturally using vision-activating keywords during the conversation.

Acknowledgements

Citation

License

About

Uh oh!

Releases

Languages

License

Vinventive/F5-TTS-AIVisionCompanion

Folders and files

Latest commit

History

Repository files navigation

AI Vision Companion as F5-TTS fork for convenience.

Features

Installation

Prerequisites

Requirements

Required Environment Variables

Usage

1. Run the main script.

2. Choose if you want to launch the app with vision capture preview window (y for yes or n for no).

2. Type the window title.

3. Wait until the vision capture completes collecting the initial sequence of frames and speech recognition becomes active.

4. Start by speaking into your microphone :)

5. Seamlessly talk about your view by naturally using vision-activating keywords during the conversation.

Acknowledgements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages

2. Choose if you want to launch the app with vision capture preview window (`y` for yes or `n` for no).