Summary

ESP32 based voice chat dialog device, successor of the earlier project KALO-ESP32-Voice-ChatGPT. With latest August 2025 update the ESP32 device allows to create multiple custom chatbots/FRIENDS (similar to Open AI's Custom GPT's or Google's Gems). Just call any FRIEND by name, the device will activate the AI personality (custom system prompt) and answer with friend’s assigned voice.

User can ask questions and following conversation via microphone (pressing a button or touch pin as long speaking, no length limit, dynamic duration). Code supports ongoing dialog sessions, keeping & sending the complete chat history. 'Chat Completions' workflow allows 'human-like' ongoing dialogs, supporting chat history & follow up questions. Example: Q1: "who was Albert Einstein?" - and later (after LLM response) - Q2: "was he also a musician and did he have kids?".

The device works multi-lingual by default, i.e. each chatbot/FRIEND can automatically understand and speak multiple languages. No changes in code (or system prompts) needed. Also mixed usage is supported (changing language in same dialog session). List of supported languages (Aug. 2025): 99 languages in STT, 57 languages for TTS.

New since August 2025: supporting 1-N chatbots/FRIENDS with user defined personality (System Prompts), custom defined TTS voice parameter allow to assign different voices to each friend. The LLM AI response latency significantly improved (about 2x faster than before), using GroqCloud API services. Groq sever API also allow to use LLM models from different sources (e.g. Meta, DeepSeek, Open AI). Project name changed from KALO-ESP32-Voice-ChatGPT (supporting Open AI only) to KALO-ESP32-Voice-AI_Friends (multiple models, multiple custom chatbots).

The included chatbot friends serve as template for your own custom chatbots, coded examples: ONYX (role of a 'good old friend'), FRED (a constantly annoyed guy), GlaDOS (the aggressive egocentric bot), or VEGGI (best friend of vegan and healthy food). You could start a virtual conversation e.g. with a human warm up question: "Hi my friend, tell me, how was your week, any exiting stories?". or waking up another friend with a statement like "Hi FRED, are you online?"

Since June 2025: Live Information requests (Real-time web searches) supported. User defined key word (e.g. GOOGLE) toggles LLM to web search models as part of the memorized chat dialog. Example: "will it rain in my region tomorrow?, please ask Google! ", or "Please check with Google, what are the latest projections for the elections tomorrow?". Mixed model usage of both models supported, allowing follow up requests (e.g. "Please summarize the search in few sentences, skip any boring details!") also to previous web searches. Key word GOOGLE works with all AI chatbots/FRIENDS.

Architecture: All is coded in C++ native (no server based components needed, no Node.JS or Python scripts or websockets used), audio recording and transcription are coded natively in C++ for I2S devices (microphone and speaker). ESP32 chat device (Wifi connected) can be used stand-alone (no Serial Monitor a/o keyboard a/o connected computer needed). Sketch might be also useful for a Text Chat device (no voice recording, no STT, no TTS needed) using a Terminal App (e.g. PuTTY) or the Serial Monitor to enter text requests.

Major Update Summaries: Update August 2025 added multiple custom chatbots/FRIENDS, LLM AI response 2x faster, using Groq server with multiple models. Update June 2025 added PSRAM support (as alternative to SD Card), ESP32-S3 support and ElevenLabs STT for 5-10x faster SpeechToText transcription. Supporting additional hardware Elato AI ESP32-S3 devices. Update April 2025 added Open AI Web Search LLM model added, supporting actual and location related Live Information capabilities (Real-time web searches) into chat dialogs. Update March 2025 added hardware support for Techiesms ESP32 Portable AI Voice Assistant.

Workflow

Explore the details in the .ino libraries, summary in a nutshell:

Recording user Voice with variable length (holding a btn), storing as .wav file (with 44 byte header) in PSRAM or SD card
User can enter LLM AI request also via text in Serial Monitor Input line or COM: Terminal Apps e.g. PuTTY)
Sending recorded WAV file to STT (SpeechToText) server, using fast ElevenLabs API (or slower Deepgram)
Sending the transcription to Open AI or GroqCloud server (with user specified LLM models) for CHAT and WEB SEARCH
Receiving AI response, printing in Serial Monitor, speaking with a 'human' like (multi-lingual) Open AI voice
RGB led indicating status: GREEN=Ready -> RED=Recording -> CYAN=STT -> BLUE=LLM AI CHAT -> PINK=Open AI WEB -> YELLOW=Audio pending -> PINK=TTS Speaking. Short WHITE flashes indicate success, RED flashes indicate keyword detection. New: double RED flashes on waking up another FRIEND
Button: PRESS & HOLD for recording + short PRESS interrupts TTS/Audio speaking OR repeats last answer (when silent)
Pressing button again to proceed in loop for ongoing chat.

Hardware requirements

Recommended: ESP32 with (!) PSRAM (tested with ESP32-WROVER and ESP32-S3), no SD Card needed
Alternatively ESP32 (e.g. ESP32-WROOM-32) with connected Micro SD Card (VSPI Default pins 5,18,19,23)
I2S digital microphone, e.g. INMP441 [I2S pins 22, 33, 35]
I2S audio amplifier, e.g. MAX98357A [I2S pins 25,26,27] with speaker
RGB status LED and optionally (recommended) an Analog Poti (for audio volume)
Ready to Go devices (examples) with ESP32 & SD card reader: Techiesms Portable AI Voice Assistant
Ready to Go devices (examples) with ESP32-S3 (PSRAM): Elato AI DIY, Elato AI devices.

API Keys (Registration needed)

STT (fast): ElevenLabs API KEY, Links: ElevenLabs (free STT includes 2.5h/month).
Alternative (slower STT): Deepgram API KEY Deepgram (200$ free)
LLM & TTS: Open AI API KEY needed (same API KEY for LLM & TTS), registration: Open AI account (5$ free)
GroqCloud LLM (fast): GroqCloud API KEY needed, registration: groqcloud (using free account, token limited).

Library Dependencies

KALO-ESP32-Voice-Chat-AI-Friends does not need any 3rd party libraries zip files to be installed (except AUDIO.H!), all functions in all lib_xy.ino’s are self-coded (WiFiClientSecure.h / i2s_std.h / SD.h are part of esp32 core libraries). AUDIO.H is needed for TTS playing audio (not needed for recording & transcription), no AUDIO.H needed in 'lib_xy.ino' libraries
ESP32 core library (Arduino DIE): use latest arduino-esp32 e.g. 3.2.0 (based on ESP-IDF 5.4.1) or later
AUDIO.H library / ESP32 with PSRAM: Install latest ESP32-audioIS zip, version 3.3.0 or later
AUDIO.H library / ESP32 without PSRAM: IMPORTANT! - Actual AUDIO.H libraries require PSRAM, ESP32 without PSRAM are no longer supported!. So you need to install last version which did not require PSRAM. Recommended version is 3.0.11g (from July 18, 2024)!. Mirror link to 3.0.11g version here
Last-not-least: Sending a big THANK YOU shout out to @Schreibfaul1 for his great AUDIO.H library and his support!.

Installation & Customizing

Libraries see above. Use latest esp32 core for Arduino IDE: arduino-esp32. AUDIO.H: Download the correct library zip file (with PSRAM here, without PSRAM 3.0.11g here), add it in Arduino IDE via Sketch -> Include Library -> Add .ZIP Library
Copy all .ino files of into same folder (it is one sketch, split into multiple Arduino IDE tabs)
Insert your credentials (ssid, password) and 3 API KEYS in header of main sketch KALO_ESP32_Voice_AI_Friends.ino
Update your hardware pin assignments (pcb template) in main sketch KALO_ESP32_Voice_AI_Friends.ino
Update your hardware microphone pins and audio storage settings (PSRAM a/o SD Card) in lib_audio_recording.ino
Create your own 1-N 'AI Friends' character' in header of new lib_OpenAI_Chat.ino
Optional: Review default settings in header of each .ino (e.g. DEBUG toggle in main.ino, recording parameter in 'lib_audio_recording.ino')
Optional: Copy Audio file 'Welcome.wav' to ESP32 SD card, played on Power On ('gong' sound)
In case of COMPILER ERROR on audio_play.openai_speech(): Check/update the last line of code in main sketch. Background: the amount of openai_speech() parameter changed with latest AUDIO.H versions.

Known issues

ESP32 without PSRAM are limited (because older AUDIO.H stress the HEAP). Well known limitations: Open AI TTS voice instruction not supported, LED response delayed, audio streaming (Radio) won't work always. Open AI TTS audio output is sometimes missed (workaround for missed TTS: short press on record btn / repeats TTS).

New features since August 2025

Supporting multiple custom chatbots/FRIENDS, activating any friend by call his/her name
Each chatbot can be assigned with different TTS parameter (voice characteristics) in the FRIENDS[] Agent structure
Faster LLM AI response since supporting fast GroqCloud server API websockets (~ 2x faster than Open AI)
GroqCloud server API allows to use LLM models from various provider (e.g. Meta, OpenAI, DeepSeek, PlayAI, Alibaba etc.), more details here: models. Posted code (default settings): using Meta "llama-3.1-8b-instant" as CHAT model (low costs, high performance), for WEBSEARCH using Open AI 'gpt-4o..search' models
New Commands, e.g. “DEBUG ON|OFF” to toggle print details, speaking "HASHTAG" to trigger "#" command
Several minor bug fixes, e.g.: Sending LLM AI payload in chunks, keeping websockets open only on ESP32 with PSRAM
Cleaning up user specific settings in header of .ino files

New features since June 2025

PSRAM supported for audio recording and transcription. SD Card no longer needed for ESP32 with PSRAM (tested with ESP32-WROVER and ESP32-S3). User #define settings for audio processing (#define RECORD_PSRAM / SDCARD)
Additional parameter added to Recording and transcription functions: 'Recording_Stop()' and 'SpeechToText_xy()'
Additional STT added: supporting ElevenLabs Scribe v1 SpeechToText API (as alternative to Deepgram STT). Multilingual support (also mixed languages in same record supported), country codes no longer needed Registration for API KEY needed link, cost free account supported
Speed of SpeechToText significantly improved (using ElevenLabs STT), in particular on long sentences!. Example: Short user voice recordings (e.g. 5 secs) are transcribed in ~ 0.5-2 secs (compared to ~5 sec with Deepgram), long user records (e.g. 20 secs!) transcribed in ~3 secs (compared to ~ 15 secs with Deepgram)
SpeechToText is multi lingual, detecting your spoken language automatically (~ 100 languages supported). As longer your spoken sentence (audio), as better the correct language detection
New Key word 'GOOGLE': Activates Open AI web search model for Live Information requests
New Key word 'VOICE': Enabling the new Voice Instruction parameter of Open AI TTS (user can force TTS character, e.g. "you are whispering"). PSRAM needed
'isRunning()' bug fix: Correct audio end detection (in past the LED still indicated Playing 1-2 secs after audio finished), solved with latest AUDIO.H version. PSRAM needed
Updated to latest models. Open AI LLM: 'gpt-4.1-nano' & 'gpt-4o-mini-search-preview'. Deepgram STT: new MULTI lingual model 'nova-3-general' added, 'nova-2-general' still used for MONO lingual. ElevenLabs STT: 'scribe_v1'. Open AI TTS: 'gpt-4o-mini-tts' (for voice instruction) and 'tts-1' (default)
Open AI TTS improved: Audio streaming quality improved (no longer clicking artefacts on beginning, letter cut offs at end resolved). PSRAM needed
minor bugs resolved, added more detailed comments into sketch, code cleaned up.

Github Updates

2025-08-11: Major update (see above). Supporting custom chatbots/FRIENDS, LLM AI response 2x faster (GroqCloud)
2025-06-28: Hardware pin assignments cleaned up (3 PCB templates), Techiesms Portable AI Voice Assistant supported by default (no longer #define TECHIESMS_PCB true needed). Added audio VOL_BTN to all devices without POTI
2025-06-19: Supporting ESP32-S3 I2S audio recording, supporting Elato AI devices DIY pcb, Elato AI products
2025-06-05: Major update, detail see above (PSRAM support, ElevenLabs Scribe v1 STT added support, faster STT
2025-04-04: Live Information Request capabilities added (supporting new Open AI web search features). Mixed support of chat model and web search model. User queries with a user defined keyword initiate a web search and embed the result in the ongoing chat. Minor changes: all user specific credits are moved to header of main.ino sketch (KALO_ESP32_Voice_ChatGPT_20250404.ino), additional parameter added to function Open_AI(..) and SpeechToText_Deepgram(..). Code further cleaned up, detailed comments added in 'lib_OpenAI_Chat.ino'
2025-03-14: Major enhancements: Supporting techiesms hardware/pcb Portable AI Voice Assistant. Code Insights: New toggle '#define TECHIESMS_PCB true' assigns all specific pins automatically (no user code changes needed). Minor enhancements: Welcome Voice (Open AI) added, RGB led colors updated, code clean up done
2025-01-26: First drop, already working, not finally cleaned up (just posted this drop on some folks request).

Next steps

Next upgrade will add some additional AI features. Done with latest August 11 release.

. . .

Demo Videos

Video 01 (Jan. 27, 2025) – 1st drop:

5 minute example chat. Using code default settings, System Prompt: Role of a 'good old friend'.

Deepgram STT language: English (en-US)
TTS voice: Multilingual (Open AI default), used voice in video: 'onyx'
Overlay window: Serial Monitor I/O in real time (using Terminal App PuTTY, just for demo purposes)
Details of interest (m:ss): 1:35 (BTN stops Audio), 2:05 (Radio gimmick), 3:15 (multi-lingual capabilities)

Video 02 (June 19, 2025) – Faster STT (SpeechToText via ElevenLabs) & Open AI Realtime websearch:

3:30 minute dialog, using latest code with default settings. Same system prompt (role of a 'good old friend').
STT: Faster (multilingual) ElevenLabs scribe v1, TTS: Open AI voice 'onyx'. Open AI LLM: web search included.

Hardware: Using the battery powered Techiesms Portable AI Voice Assistant with ESP32 & SD card. Mounted on a Lasercut Acryl chassis (3.7V Li-Po battery placed inside double bottom), added a metal control (fixed steel button) as TOUCH button (connected to GPIO-12)
Details of interest: Playing gong audio (welcome.wav on SD card), then real-time created (German) TTS welcome voice. Multilingual: Requesting Open AI to jump from (my) default 'German' to English language (min:sec 0:25). Recording long sentences, STT transcription still < 2 secs (confirmed with short white led flashes). Live Information request: Key word 'Google' (min:sec 1:55) activates Open AI web search model once (weather forecast 'today', June 19), embedded into the follow up dialog until end of session. Detail (min:sec 2:32) short record touch 'interrupts' TTS (2nd touch would 'repeat' last TTS again).

NEW Video 03 (August 15, 2025) – AI Friends update (multiple chatbots) & LLM AI speed improved (Groq API):

6:13 minute dialog, latest August code with multiple AI friends on an ESP32S3 Elato device (default code settings).

Hardware: Using a battery powered older version of the Elato AI device, ESP32S3 with PSRAM (no SD card). Hint: Mounted a tiny metal screw as TOUCH record button (original side btn on side is used for audio volume)
Demo: Chatting with 3 of my friends (Onyx, Veggi, Fred), waking up the buddies with calling by name, web search included. Using PSRAM and latest AUDIO.H allows to create voices with emotions
Details of interest: Jumping from 'good old default friend' ONYX to the 'food specialist' VEGGI (min:sec 0:28), interrupting VEGGI on demand (1:57), waking-up FRED (the annoyed and aggressive buddy with an aggressive voice (2:00), listen to his emotions in voice! e.g. on 3:30 (hint: PSRAM is mandatory), waking up my earlier friendly ONYX again (fails on 3:50, no big deal: just calling him again on 4:09), and finally starting Google Search by request (4:18) to request latest data from TODAY (August 15!). Also might be of interest: Google websearch is embedded into Chat (ONYX himself speaks about Google on 5:30).
RGB color (recap): STT (cyan) -> LLM AI (blue) -> TTS (yellow>pink, looks white in video). WHITE flash = Success. NEW: Double RED flash = waking up another FRIEND.

. . .

Links of interest, featuring friend’s projects:

Advanced ESP32 AI service, using streaming sockets for lowest latency: Github ElatoAI, Github StarmoonAI
Ready to Go hardware: Techiesms Portable AI Voice Assistant
Ready to Go hardware: Elato AI DIY, Elato AI products

Name		Name	Last commit message	Last commit date
Latest commit History 402 Commits
KALO_ESP32_Voice_Chat_AI_Friends		KALO_ESP32_Voice_Chat_AI_Friends
libray_archive		libray_archive
README.md		README.md
Welcome.wav		Welcome.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Summary

Workflow

Hardware requirements

API Keys (Registration needed)

Library Dependencies

Installation & Customizing

Known issues

New features since August 2025

New features since June 2025

Github Updates

Next steps

Demo Videos

About

Uh oh!

Releases

Packages

Languages

kaloprojects/KALO-ESP32-Voice-Chat-AI-Friends

Folders and files

Latest commit

History

Repository files navigation

Summary

Workflow

Hardware requirements

API Keys (Registration needed)

Library Dependencies

Installation & Customizing

Known issues

New features since August 2025

New features since June 2025

Github Updates

Next steps

Demo Videos

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages