DIY ESP32-S3 Voice Assistant V0.2: Upgrade for Real Voice Input (INMP441 Mic + 16MB Memory)
by circuitsmiles in Circuits > Microcontrollers
198 Views, 1 Favorites, 0 Comments
DIY ESP32-S3 Voice Assistant V0.2: Upgrade for Real Voice Input (INMP441 Mic + 16MB Memory)
Welcome to the V0.2 Major Upgrade of our ESP32 AI Chat Bot!
In the previous version, our "assistant" could only respond to predefined text prompts selected by a button. This was great for efficiency and privacy, but let's face it: we want to talk to our devices!
This guide will walk you through the essential steps to transform the project into a true, conversational Voice Assistant capable of real-time speech processing. The key changes involve a critical hardware upgrade to manage audio memory and the addition of a high-quality digital microphone.
Supplies
Hardware Components:
- Microcontroller: ESP32 S3 N16R8
- Display: 0.96" OLED Display (SSD1306, I2C interface)
- Audio Output: MAX98357A I2S Class-D Amplifier + Small 8-ohm Speaker
- Audio Input: INMP441 I2S Microphone
- User Input: 2x Tactile Buttons
- Visual Cues: onboard RGB LED
- Miscellaneous: Breadboard, Jumper Wires (male-to-male), USB Power Supply (at least 1A)
Software & Accounts:
- Visual Studio Code with PlatformIO
- Python 3 (for the server)
- A Google API Key (for Gemini API access)
The Wiring - Connecting Everything Up
This is where the physical build comes together. Take your time, double-check connections, and ensure your ESP32 is powered off while wiring. All GND pins from components should connect to a common ground rail on your breadboard.
Firmware Flash - Programming the ESP32
We are using PlatformIO for easy management of the large firmware and the ESP32-S3's unique memory configuration.
- Download Project: Clone the V0.2 code base from GitHub
- PlatformIO Setup: Open the project in VS Code with the PlatformIO extension installed.
- Configuration: Update the platformio.ini file to correctly specify the ESP32-S3-N16R8 with appropriate memory/partition settings.
- Modify: update server address (you can get it after launching the server)
- Compile & Upload: Use the PlatformIO Build and Upload buttons to flash the firmware onto your ESP32-S3 board.
- WiFi: if your wifi credentials aren't already setup, you would be prompted to set it up after the first upload
The AI Server - Python & Gemini
The server now needs to handle raw audio data (or transcribed text) coming from the ESP32, which is captured by the INMP441. Get code from GitHub
- Install Python: Ensure you have Python 3 installed.
- Virtual Environment (Recommended):
- python3 -m venv venv
- source venv/bin/activate (macOS/Linux) or venv\Scripts\activate (Windows)
- Install Dependencies: run - pip install -r requirements.txt
- Get Gemini API Key: Go to the Google AI Studio to get your GEMINI_API_KEY.
- Create .env file: In the same directory as your server.py file, create a new file named .env and add: GEMINI_API_KEY="YOUR_API_KEY_HERE"
- Run the Server: Open a terminal in your server's directory and run: python server.py The server will now be running, waiting for requests from your ESP32!
Testing the Voice Assistant
Operation: Your AI Assistant Is Ready!
- Status Check: The OLED should display "Ready" and the inbuilt RGB LED should show the "idle" color.
- Start Recording: Press Button 1. The inbuilt RGB LED should change color (e.g., to green) to indicate it is listening.
- Speak: Ask your question!
- Stop/Timeout: Recording will stop after 6 seconds or when you press Button 2.
- AI Response: The OLED will show "Thinking..." and then the response audio will play via the speaker.
Conclusion
Conclusion: The Power of the Upgrade
The ESP32 Voice Assistant V0.2 represents a massive leap forward. By making the strategic switch to the memory-rich ESP32-S3-N16R8 and integrating the INMP441 I2S Microphone, we successfully overcame the memory hurdles of V0.1. We have transformed a button-driven prompt machine into a truly conversational, voice-input-enabled AI device, all while keeping the build clean by utilizing the onboard RGB LED for status cues. This project proves that high-performance AI hardware is fully accessible using powerful, modern microcontrollers.
Future Enhancements (V0.3 Preview)
While the two-button system provides reliable, user-controlled recording, the ultimate goal for a voice assistant is completely hands-free interaction. For V0.3, our focus will be on removing the buttons entirely by implementing Wake Word Detection.
Planned V0.3 Enhancements:
- Hands-Free Activation: Implement a wake word model (e.g., using TinyML or a platform like Edge Impulse) to allow the ESP32-S3 to constantly monitor audio from the INMP441 microphone.
- Button Removal: The current Start and Stop buttons will be eliminated. Recording will automatically begin when the wake word is detected and end after a pause in speech (or a defined timeout).
- Optimized Power: Explore deep sleep modes or highly optimized wake word libraries to ensure the always-listening state doesn't drain the power source excessively.
Stay tuned for the next evolution of the project, where we finally achieve a fully seamless, voice-activated AI experience!