DIY ESP32-S3 Voice Assistant V0.2: Upgrade for Real Voice Input (INMP441 Mic + 16MB Memory)

by circuitsmiles in Circuits > Microcontrollers

938 Views, 2 Favorites, 0 Comments

DIY ESP32-S3 Voice Assistant V0.2: Upgrade for Real Voice Input (INMP441 Mic + 16MB Memory)

ESP32 AI assistant - version 2: Real Voice Input with INMP441! (16MB Memory Upgrade)

Welcome to the V0.2 Major Upgrade of our ESP32 AI Chat Bot!

In the previous version, our "assistant" could only respond to predefined text prompts selected by a button. This was great for efficiency and privacy, but let's face it: we want to talk to our devices!

This guide will walk you through the essential steps to transform the project into a true, conversational Voice Assistant capable of real-time speech processing. The key changes involve a critical hardware upgrade to manage audio memory and the addition of a high-quality digital microphone.

Supplies

Hardware Components:

Microcontroller: ESP32 S3 N16R8
Display: 0.96" OLED Display (SSD1306, I2C interface)
Audio Output: MAX98357A I2S Class-D Amplifier + Small 8-ohm Speaker
Audio Input: INMP441 I2S Microphone
User Input: 2x Tactile Buttons
Visual Cues: onboard RGB LED
Miscellaneous: Breadboard, Jumper Wires (male-to-male), USB Power Supply (at least 1A)

Software & Accounts:

Visual Studio Code with PlatformIO
Python 3 (for the server)
A Google API Key (for Gemini API access)

The Wiring - Connecting Everything Up

This is where the physical build comes together. Take your time, double-check connections, and ensure your ESP32 is powered off while wiring. All GND pins from components should connect to a common ground rail on your breadboard.

Firmware Flash - Programming the ESP32

We are using PlatformIO for easy management of the large firmware and the ESP32-S3's unique memory configuration.

Download Project: Clone the V0.2 code base from GitHub
PlatformIO Setup: Open the project in VS Code with the PlatformIO extension installed.
Configuration: Update the platformio.ini file to correctly specify the ESP32-S3-N16R8 with appropriate memory/partition settings.
Modify: update server address (you can get it after launching the server)
Compile & Upload: Use the PlatformIO Build and Upload buttons to flash the firmware onto your ESP32-S3 board.
WiFi: if your wifi credentials aren't already setup, you would be prompted to set it up after the first upload

The AI Server - Python & Gemini

The server now needs to handle raw audio data (or transcribed text) coming from the ESP32, which is captured by the INMP441. Get code from GitHub

Install Python: Ensure you have Python 3 installed.
Virtual Environment (Recommended):
python3 -m venv venv
source venv/bin/activate (macOS/Linux) or venv\Scripts\activate (Windows)
Install Dependencies: run - pip install -r requirements.txt
Get Gemini API Key: Go to the Google AI Studio to get your GEMINI_API_KEY.
Create .env file: In the same directory as your server.py file, create a new file named .env and add: GEMINI_API_KEY="YOUR_API_KEY_HERE"
Run the Server: Open a terminal in your server's directory and run: python server.py The server will now be running, waiting for requests from your ESP32!

Testing the Voice Assistant

Operation: Your AI Assistant Is Ready!

Status Check: The OLED should display "Ready" and the inbuilt RGB LED should show the "idle" color.
Start Recording: Press Button 1. The inbuilt RGB LED should change color (e.g., to green) to indicate it is listening.
Speak: Ask your question!
Stop/Timeout: Recording will stop after 6 seconds or when you press Button 2.
AI Response: The OLED will show "Thinking..." and then the response audio will play via the speaker.

Conclusion

Conclusion: The Power of the Upgrade

The ESP32 Voice Assistant V0.2 represents a massive leap forward. By making the strategic switch to the memory-rich ESP32-S3-N16R8 and integrating the INMP441 I2S Microphone, we successfully overcame the memory hurdles of V0.1. We have transformed a button-driven prompt machine into a truly conversational, voice-input-enabled AI device, all while keeping the build clean by utilizing the onboard RGB LED for status cues. This project proves that high-performance AI hardware is fully accessible using powerful, modern microcontrollers.

Future Enhancements (V0.3 Preview)

While the two-button system provides reliable, user-controlled recording, the ultimate goal for a voice assistant is completely hands-free interaction. For V0.3, our focus will be on removing the buttons entirely by implementing Wake Word Detection.

Planned V0.3 Enhancements:

Hands-Free Activation: Implement a wake word model (e.g., using TinyML or a platform like Edge Impulse) to allow the ESP32-S3 to constantly monitor audio from the INMP441 microphone.
Button Removal: The current Start and Stop buttons will be eliminated. Recording will automatically begin when the wake word is detected and end after a pause in speech (or a defined timeout).
Optimized Power: Explore deep sleep modes or highly optimized wake word libraries to ensure the always-listening state doesn't drain the power source excessively.

Stay tuned for the next evolution of the project, where we finally achieve a fully seamless, voice-activated AI experience!