About • Installation • How To Use • Credits • License
This repository contains an implementation of an intelligent voice assistant. The solution is based on the combination of Automatic Speech Recognition (ASR), Text To Speech (TTS), and Large Language Models (LLM) systems.
The assistant is activated using a Keyword-Spotting system (KWS) with sheila
as a target word. Then, the user says the query and an ASR model converts speech query into text. The text query is given as input to an LLM, and its response is converted back to audio using a TTS system. After the audio playback is finished, the user can continue the dialogue. The LLM preserves the history of the chat.
The version with default choice of models works fast even on CPU! For better transcription quality, consider using a different ASR model from HuggingFace (e.g. openai/whisper-large-v2
with a GPU instead of CPU to make it work fast enough).
See the LauzHack Workshop with the discussion on how to create intelligent voice assistants and this repository (also see Slides).
To install the assistant, follow these steps:
-
(Optional) Create and activate new environment using
conda
orvenv
(+pyenv
).a.
conda
version:# create env conda create -n project_env python=PYTHON_VERSION # activate env conda activate project_env
b.
venv
(+pyenv
) version:# create env ~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env # alternatively, using default python version python3 -m venv project_env # activate env source project_env/bin/activate
-
Install all required packages
pip install -r requirements.txt
-
(Optional) Install
pre-commit
:pre-commit install
-
Create an API key in Groq. Create a new file named
.env
in the root directory and copy-paste your API key into it.
To record and play sound, you need to define your hardware settings. See more in the PyTorch documentation (information about ffmpeg
specifically) and this tutorial. Usually, the format is alsa
for Linux systems and avfoundation
for Mac systems. For the reader source and writer dst, the default
option usually works (so it might be enough to change the format only in your case).
When the hardware is known, you can start AI AudioBot using this command:
python3 run.py stream_reader.source=YOUR_MICROPHONE \
stream_reader.format=YOUR_FORMAT \
stream_writer.dst=YOUR_LOUDSPEAKER \
stream_writer.format=YOUR_FORMAT
You can also change other parameters via Hydra options. See src/configs/audio_bot.yaml
. For example, you can change the maximum number of output tokens and LLM model:
python3 run.py llm.model_id="mixtral-8x7b-32768" llm.max_tokens=256
Use Keyboard Interrupt
(ctrl+C
) to stop the assistant.
HuggingFace was used for ASR and TTS models (Spectrogram Generator and Vocoder). Groq API with llama-3-8b-8192 model was used for LLM. The KWS model is taken from the 2022 version of the HSE DLA Course.