Create Your Own AI Voice Agent Using EchoKit, ESP32, and Rust
by Vivian Hu in Circuits > Gadgets
509 Views, 3 Favorites, 0 Comments
Create Your Own AI Voice Agent Using EchoKit, ESP32, and Rust
Have you ever wanted to build your very own voice AI Agent — one that actually talks back to you?
In this tutorial, we'll show you how to build a fun and interactive AI assistant using EchoKit, a powerful yet easy-to-build voice AI agent powered by ESP32.
As an open-source project, EchoKit not only lets you play with cutting-edge AI, but it also allows you to understand the underlying technology and modify it to suit your needs — perfect for classrooms, makerspaces, or personal AI projects.
In just a few minutes, you’ll have EchoKit talking back to you and understanding your commands — whether you’re a student, teacher, or maker passionate about exploring AI.
Supplies
You’ll need:
- EchoKit board (ESP32-based, open-source AI hardware, available at echokit.dev)
- USB-C cable
- Laptop / PC (Windows, macOS, or Linux)
- Wi-Fi connection
- API key and Endpoint URLs for the Whisper, LLM, and TTS model.
Assemble the EchoKit Device
When you receive your EchoKit device, you’ll find four key components:
- The ESP32-S3 development board
- The extension board, which includes the audio and microphone module
- A mini speaker
- A 1.54" LCD screen
The assembly process is simple — just follow these steps:
- Insert the mini speaker into the audio module on the extension board.
- Attach the ESP32-S3 development board to the extension board.
- Plug the LCD screen into the designated slot at the top of the extension board.
- That’s it — your EchoKit is now ready for the next step!
Flash the EchoKit Device
Now that your EchoKit is assembled, it’s time to flash the firmware.
- Connect EchoKit to your computer using the included USB-C cable.
- Use the ESP32 launchpad to easily flash the firmware. Open the launchpad, follow the instructions to “Connect” and then “Flash.”
- After flashing, you should see a QR code and hear a welcoming voice — your EchoKit is ready to go!
Alternatively, you can use the espflash command line to flash the hardware. Check out the details here.
Set Up the Server
Now comes the exciting part: setting up the server to power your voice AI agent!
The EchoKit server is responsible for managing communication between your device and AI services like Whisper (ASR), LLM, and TTS. It’s fully customizable — giving you control over your AI’s responses, voice, and more.
- You can use the EchoKit team's provided server for quick setup directly and go to the step 4.
- USA: ws://indie.echokit.dev/ws/
- Asia: ws://hk.echokit.dev/ws/
- But if you prefer full customization, I recommend setting up your own server. You’ll be able to tweak every aspect of your AI’s behavior, from response generation to voice synthesis. Make sure you have Rust installed.
To start, you’ll need to download the server code. Open a terminal and run the following command:
Once you’ve cloned the repository, navigate to the config.toml file. In this file, you will configure the following:
- ASR (Automatic Speech Recognition): This tells the server where to send audio transcriptions.
- LLM (Large Language Model): The model used to generate AI responses.
- TTS (Text-to-Speech): The model that converts text responses into speech.
For your convenience, it's recommended to use Groq here. I don't think you will need to pay any cents for your usage for this project.
Below is an example for using Groq and you just need to add your own API key.
Once you configure these parameters, you can run the following command line to start the server.
If everything goes well, you’ll see output like this in your terminal:
The server runs successfully now. Next let's connect the server and the device.
Connect the Server and Device
Before you begin, make sure you have the EchoKit server running on your local machine or on a remote server. If you're running the EchoKit server locally, follow the setup instructions to start the server before proceeding.
- Open https://echokit.dev/setup/ in your browser. Make sure you're using a browser that supports Bluetooth. Chrome is a good one.
- Click the “Connect to EchoKit” button to start pairing your EchoKit devicer.
- Then, you’ll need to enter the following information:
- Wi-Fi Name: Enter the name of your 2.4G Wi-Fi network.
- Wi-Fi Password: Enter the password for the Wi-Fi network.
- Server URL: In the format of ws://192.168.1.56:8080/ws — replace 192.168.1.56 with the IP address of your server and 8080 with the port number where your EchoKit server is running.
- USA: ws://indie.echokit.dev/ws/
- Asia: ws://hk.echokit.dev/ws/
- Apply the Settings
- Press the K0 button to apply these settings and establish the connection.
Once you've done this, you’ll see the progress on the EchoKit screen, which will show steps like "Restarting Device", "Connecting to Wi-Fi", and "Connecting to Server." When the process is complete, you’ll hear a welcome voice, and the screen will display "Hello Set."
Verify the Connection
If you're running the EchoKit server locally, you should also see the following message in the server log, confirming that the connection was successful:
Talk With the EchoKit
Now, let’s start interacting with your EchoKit voice AI agent!
- Press the K0 Button to enter chat mode.
- When you see “Listening” on the screen, you’re ready to talk to EchoKit.
Since we’re using the ASR-LLM-TTS system, here’s how it works:
- ASR (Automatic Speech Recognition) will first transcribe what you say into text.
- The LLM (Large Language Model) will generate a response based on your input and the custom prompt you’ve set up.
- Finally, the TTS (Text-to-Speech) model will read the generated response back to you.
Because EchoKit uses these three powerful models in combination, it might take a few moments for it to respond, but Groq's optimized performance ensures a quick response time (usually only a few seconds).
What's Next
Now that you’ve built your own voice AI agent, the possibilities are endless. If you’re looking to explore even more features, here are some options to take your EchoKit experience to the next level:
- Explore the End-to-End Model: If you're interested in simplifying the process, you can use the end-to-end model like Gemini with EchoKit. This model streamlines the entire ASR-LLM-TTS pipeline into a single step, making it even easier to interact with your AI agent. However, using the modular approach gives you less flexibility and control over each step — so feel free to experiment with both!
- Add Custom Actions with MCP: EchoKit also supports MCP (Multi-Channel Processing), which allows you to add custom actions to your voice AI agent. With MCP, you can control external devices or trigger specific events based on voice commands, opening up endless possibilities for automation and smart systems. Whether it’s controlling a smart home device or creating interactive experiences, MCP offers a powerful way to extend EchoKit’s capabilities.
EchoKit is an open-source platform, so there’s always room to customize and explore. You can modify the behavior of your AI agent, integrate new AI models, or even contribute to the community. Check out the EchoKit website and resources for more advanced tutorials and examples.