AudioBot

About • Installation • How To Use • Credits • License

About

This repository contains an implementation of an intelligent voice assistant. The solution is based on the combination of Automatic Speech Recognition (ASR), Text To Speech (TTS), and Large Language Models (LLM) systems.

The assistant is activated using a Keyword-Spotting system (KWS) with sheila as a target word. Then, the user says the query and an ASR model converts speech query into text. The text query is given as input to an LLM, and its response is converted back to audio using a TTS system. After the audio playback is finished, the user can continue the dialogue. The LLM preserves the history of the chat.

The version with default choice of models works fast even on CPU! For better transcription quality, consider using a different ASR model from HuggingFace (e.g. openai/whisper-large-v2 with a GPU instead of CPU to make it work fast enough).

See the LauzHack Workshop with the discussion on how to create intelligent voice assistants and this repository (also see Slides).

Installation

To install the assistant, follow these steps:

(Optional) Create and activate new environment using conda or venv (+pyenv).

a. conda version:

# create env
conda create -n project_env python=PYTHON_VERSION

# activate env
conda activate project_env

b. venv (+pyenv) version:

# create env
~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env

# alternatively, using default python version
python3 -m venv project_env

# activate env
source project_env/bin/activate

Install all required packages
```
pip install -r requirements.txt
```
(Optional) Install pre-commit:
```
pre-commit install
```
Create an API key in Groq. Create a new file named .env in the root directory and copy-paste your API key into it.

How To Use

To record and play sound, you need to define your hardware settings. See more in the PyTorch documentation (information about ffmpeg specifically) and this tutorial. Usually, the format is alsa for Linux systems and avfoundation for Mac systems. For the reader source and writer dst, the default option usually works (so it might be enough to change the format only in your case).

When the hardware is known, you can start AI AudioBot using this command:

python3 run.py stream_reader.source=YOUR_MICROPHONE \
    stream_reader.format=YOUR_FORMAT \
    stream_writer.dst=YOUR_LOUDSPEAKER \
    stream_writer.format=YOUR_FORMAT

You can also change other parameters via Hydra options. See src/configs/audio_bot.yaml. For example, you can change the maximum number of output tokens and LLM model:

python3 run.py llm.model_id="mixtral-8x7b-32768" llm.max_tokens=256

Use Keyboard Interrupt (ctrl+C) to stop the assistant.

Credits

HuggingFace was used for ASR and TTS models (Spectrogram Generator and Vocoder). Groq API with llama-3-8b-8192 model was used for LLM. The KWS model is taken from the 2022 version of the HSE DLA Course.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AudioBot

About

Installation

How To Use

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Blinorot/AudioBot

Folders and files

Latest commit

History

Repository files navigation

AudioBot

About

Installation

How To Use

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages