Create Your Own AI Voice Agent Using EchoKit, ESP32, and Rust

683 Views, 3 Favorites, 0 Comments

Create Your Own AI Voice Agent Using EchoKit, ESP32, and Rust

Have you ever wanted to build your very own voice AI Agent — one that actually talks back to you?

In this tutorial, we'll show you how to build a fun and interactive AI assistant using EchoKit, a powerful yet easy-to-build voice AI agent powered by ESP32.

As an open-source project, EchoKit not only lets you play with cutting-edge AI, but it also allows you to understand the underlying technology and modify it to suit your needs — perfect for classrooms, makerspaces, or personal AI projects.

In just a few minutes, you’ll have EchoKit talking back to you and understanding your commands — whether you’re a student, teacher, or maker passionate about exploring AI.

Supplies

You’ll need:

EchoKit board (ESP32-based, open-source AI hardware, available at echokit.dev)
USB-C cable
Laptop / PC (Windows, macOS, or Linux)
Wi-Fi connection
API key and Endpoint URLs for the Whisper, LLM, and TTS model.

Assemble the EchoKit Device

When you receive your EchoKit device, you’ll find four key components:

The ESP32-S3 development board
The extension board, which includes the audio and microphone module
A mini speaker
A 1.54" LCD screen

The assembly process is simple — just follow these steps:

Insert the mini speaker into the audio module on the extension board.
Attach the ESP32-S3 development board to the extension board.
Plug the LCD screen into the designated slot at the top of the extension board.
That’s it — your EchoKit is now ready for the next step!

Flash the EchoKit Device

Now that your EchoKit is assembled, it’s time to flash the firmware.

Connect EchoKit to your computer using the included USB-C cable.
Use the ESP32 launchpad to easily flash the firmware. Open the launchpad, follow the instructions to “Connect” and then “Flash.”
After flashing, you should see a QR code and hear a welcoming voice — your EchoKit is ready to go!

Alternatively, you can use the espflash command line to flash the hardware. Check out the details here.

Set Up the Server

Now comes the exciting part: setting up the server to power your voice AI agent!

The EchoKit server is responsible for managing communication between your device and AI services like Whisper (ASR), LLM, and TTS. It’s fully customizable — giving you control over your AI’s responses, voice, and more.

You can use the EchoKit team's provided server for quick setup directly and go to the step 4.
USA: ws://indie.echokit.dev/ws/
Asia: ws://hk.echokit.dev/ws/
But if you prefer full customization, I recommend setting up your own server. You’ll be able to tweak every aspect of your AI’s behavior, from response generation to voice synthesis. Make sure you have Rust installed.

To start, you’ll need to download the server code. Open a terminal and run the following command:

git clone https://github.com/second-state/echokit_server.git

Once you’ve cloned the repository, navigate to the config.toml file. In this file, you will configure the following:

ASR (Automatic Speech Recognition): This tells the server where to send audio transcriptions.
LLM (Large Language Model): The model used to generate AI responses.
TTS (Text-to-Speech): The model that converts text responses into speech.

For your convenience, it's recommended to use Groq here. I don't think you will need to pay any cents for your usage for this project.

Below is an example for using Groq and you just need to add your own API key.

addr = "0.0.0.0:8080"

hello_wav = "hello.wav"

[asr]

url = "https://api.groq.com/openai/v1/audio/transcriptions"

lang = "en"

api_key = "gsk_xxx"

model = "whisper-large-v3-turbo"

[llm]

llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"

api_key = "gsk_xxx"

model = "llama-3.3-70b-versatile"

history = 1

[tts]

platform = "Groq"

api_key = "gsk_xxx"

model = "playai-tts"

voice = "Aaliyah-PlayAI"

[[llm.sys_prompts]]

role = "system"

content = """

# input your prompt here.

"""

Once you configure these parameters, you can run the following command line to start the server.

# Build the project

cargo build --release

# Enable debug logging

export RUST_LOG=debug

# Run the EchoKit server

target/release/echokit_server

If everything goes well, you’ll see output like this in your terminal：

[2025-10-13T09:37:13Z INFO echokit_server] Hello WAV: hello.wav

The server runs successfully now. Next let's connect the server and the device.

Connect the Server and Device

Before you begin, make sure you have the EchoKit server running on your local machine or on a remote server. If you're running the EchoKit server locally, follow the setup instructions to start the server before proceeding.

Open https://echokit.dev/setup/ in your browser. Make sure you're using a browser that supports Bluetooth. Chrome is a good one.
Click the “Connect to EchoKit” button to start pairing your EchoKit devicer.
Then, you’ll need to enter the following information:
Wi-Fi Name: Enter the name of your 2.4G Wi-Fi network.
Wi-Fi Password: Enter the password for the Wi-Fi network.
Server URL: In the format of ws://192.168.1.56:8080/ws — replace 192.168.1.56 with the IP address of your server and 8080 with the port number where your EchoKit server is running.
USA: ws://indie.echokit.dev/ws/
Asia: ws://hk.echokit.dev/ws/
Apply the Settings
Press the K0 button to apply these settings and establish the connection.

Once you've done this, you’ll see the progress on the EchoKit screen, which will show steps like "Restarting Device", "Connecting to Wi-Fi", and "Connecting to Server." When the process is complete, you’ll hear a welcome voice, and the screen will display "Hello Set."

Verify the Connection

If you're running the EchoKit server locally, you should also see the following message in the server log, confirming that the connection was successful:

echokit_server::services::ws] 98a316f0bcc5:b24de72669964a08b2bd4b2d47c14d76 connected.

Talk With the EchoKit

EchoKit Demo: Recommend BBQ in a Texan accent

Now, let’s start interacting with your EchoKit voice AI agent!

Press the K0 Button to enter chat mode.
When you see “Listening” on the screen, you’re ready to talk to EchoKit.

Since we’re using the ASR-LLM-TTS system, here’s how it works:

ASR (Automatic Speech Recognition) will first transcribe what you say into text.
The LLM (Large Language Model) will generate a response based on your input and the custom prompt you’ve set up.
Finally, the TTS (Text-to-Speech) model will read the generated response back to you.

Because EchoKit uses these three powerful models in combination, it might take a few moments for it to respond, but Groq's optimized performance ensures a quick response time (usually only a few seconds).

What's Next

Now that you’ve built your own voice AI agent, the possibilities are endless. If you’re looking to explore even more features, here are some options to take your EchoKit experience to the next level:

Explore the End-to-End Model: If you're interested in simplifying the process, you can use the end-to-end model like Gemini with EchoKit. This model streamlines the entire ASR-LLM-TTS pipeline into a single step, making it even easier to interact with your AI agent. However, using the modular approach gives you less flexibility and control over each step — so feel free to experiment with both!
Add Custom Actions with MCP: EchoKit also supports MCP (Multi-Channel Processing), which allows you to add custom actions to your voice AI agent. With MCP, you can control external devices or trigger specific events based on voice commands, opening up endless possibilities for automation and smart systems. Whether it’s controlling a smart home device or creating interactive experiences, MCP offers a powerful way to extend EchoKit’s capabilities.

EchoKit is an open-source platform, so there’s always room to customize and explore. You can modify the behavior of your AI agent, integrate new AI models, or even contribute to the community. Check out the EchoKit website and resources for more advanced tutorials and examples.