Building an Offline Text-to-Speech System With ESP32

by ElectroScope Archive in Circuits > Electronics

42 Views, 1 Favorites, 0 Comments

Building an Offline Text-to-Speech System With ESP32

esp32-text-speech.jpg

I've been tinkering with ESP32 boards for a while now, and one thing that always bugged me was how most text-to-speech projects depend on cloud services. You need WiFi, you deal with lag, and if your internet goes down, your project just sits there like a brick. So I decided to build something that works completely offline using the Talkie library and an ESP32.

The result? A standalone system that converts text to speech without any internet connection. It's not going to sound like Siri, but it's clear enough for alerts, notifications, or any project where you need audio feedback. Plus, the whole setup costs maybe $15 if you're buying everything new.

Supplies

  1. ESP32 development board (any version with DAC pins)
  2. PAM8403 audio amplifier module
  3. Small speaker (4-8 ohms)
  4. Breadboard
  5. Jumper wires (mix of male-to-male and male-to-female)
  6. USB cable for programming

How This Actually Works

Working-Flow-ESP32-Offline-TTS.jpg

The ESP32 uses something called Linear Predictive Coding, which is the same compression technique used in those old Speak & Spell toys from the 80s. It's not high-fidelity, but it's incredibly efficient. The Talkie library stores hundreds of words as tiny LPC arrays that live right on the ESP32's memory.

When you type a sentence into the Serial Monitor, the ESP32 breaks it into individual words, matches each word against its vocabulary, and outputs an analog audio signal through one of its DAC pins. That signal is way too quiet to drive a speaker directly, which is where the PAM8403 amplifier comes in. It boosts the signal to something you can actually hear.


Wiring Everything Together

Circuit-Diagram-of-esp32-TTS.jpg
Breadboard-Connection-of-ESP32-TTS-Converter.jpg

The connections are straightforward. The ESP32 has two DAC pins - GPIO25 and GPIO26. I'm using GPIO25 for this build, but either one works fine.

ESP32 to PAM8403 connections:

  1. GPIO25 → R input on PAM8403 (the audio signal)
  2. 5V → VCC on PAM8403
  3. GND → GND on PAM8403

PAM8403 to Speaker:

  1. R+ → Speaker positive
  2. R- → Speaker negative

I built mine on a breadboard first to test everything. Once you verify it works, you could solder it to a protoboard if you want something more permanent. The PAM8403 modules usually come with a volume potentiometer already on the board, which is handy for adjusting the output level.

One thing I learned the hard way: make sure your speaker impedance matches what the PAM8403 expects. I tried using a random speaker I pulled from an old radio, and it sounded terrible until I checked the specs and found it was 16 ohms. Stick with 4-8 ohm speakers and you'll be fine.

Setting Up the Software

First, grab the Talkie library from GitHub: https://github.com/ArminJo/Talkie

Install it through the Arduino IDE library manager, or download it manually and drop it in your libraries folder. The library comes with a vocabulary file called Vocab_US_Large.h that has a bunch of pre-recorded words already encoded as LPC data.

Here's the basic structure of the code. I'll walk through the important parts:

#include <Talkie.h>
#include "Vocab_US_Large.h"

Talkie voice;

The voice object is what actually handles playing back the audio. Every time you call voice.say(), it processes LPC data and sends it out through the DAC pin.

The tricky part is mapping text to the LPC arrays. The library doesn't convert arbitrary text to phonemes - it only knows words you explicitly define. So I created a simple dictionary structure:

struct WordMap {
const char* text;
const unsigned char* lpc;
};

WordMap words[] = {
{"ZERO", sp2_ZERO},
{"ONE", sp2_ONE},
{"TWO", sp2_TWO},
// ... add more words here
};

When you type "ONE", the system looks up sp2_ONE from the vocabulary file and plays it. This means you're limited to words that have LPC data available, but the included vocabulary has most common words you'd need for basic projects.

Making It Speak

The actual speech function does a case-insensitive search through the dictionary:

void speakWord(const char* w) {
for (int i = 0; i < wordCount; i++) {
if (strcasecmp(w, words[i].text) == 0) {
voice.say(words[i].lpc);
return;
}
}
Serial.print("Word not found: ");
Serial.println(w);
}

If it finds a match, it plays the word. If not, it tells you in the Serial Monitor. This feedback is actually really useful when you're testing because you'll know immediately if a word isn't in your vocabulary.

The main loop handles parsing sentences. It reads input from Serial Monitor, splits it into words, and speaks each one:

void loop() {
if (Serial.available()) {
String line = Serial.readStringUntil('\n');
line.trim();
line.toUpperCase();
int start = 0;
for (int i = 0; i <= line.length(); i++) {
if (i == line.length() || line[i] == ' ') {
String w = line.substring(start, i);
speakWord(w.c_str());
start = i + 1;
}
}
}
}

I convert everything to uppercase because the dictionary entries are all caps. Makes the matching simpler.

Testing Your Build

Text-to-Speech-Serial-Monitor-Display.jpg

Upload the code and open the Serial Monitor at 9600 baud. Type something like "START MACHINE" or "POWER ALERT" and hit enter. The ESP32 should speak each word through the speaker.

The first time I heard it work, the voice quality surprised me. It's definitely robotic - kind of like those automated phone systems from the 90s - but every word is clear and understandable. For alerts or status messages, it's perfect.

If a word doesn't work, you'll see "Word not found" in the Serial Monitor. That means either you need to add that word to your dictionary or find an alternative that's already there.

Common Problems I Ran Into

No sound at all: Double-check your GPIO25 connection. I had a loose wire the first time and couldn't figure out why nothing was happening. Also verify the PAM8403 is getting 5V power.

Distorted or crackling audio: This usually means either your speaker impedance is wrong or the volume is cranked too high. Try turning down the PAM8403's potentiometer. I keep mine around 50% and it sounds clean.

Words getting cut off: The Talkie library needs a tiny pause between words. If you're programmatically generating lots of speech, add a small delay. I found that 100ms works well: delay(100);

Very quiet output: Make sure you're using the amplifier. The raw DAC output from GPIO25 isn't nearly loud enough on its own.

Expanding the Vocabulary

The included vocabulary is decent but you might want words it doesn't have. There are tools online to convert WAV files to LPC format, though I haven't tried them myself. The easier approach is to work around the available words. For example, instead of saying "TEMPERATURE HIGH", you could say "TEMP ALERT" or "HOT WARNING" using words you know exist.

You can also combine multiple words to create compound phrases. I made mine say "SYSTEM READY" when it boots up, and "ERROR" followed by a number for different fault codes.

Practical Uses

I've built a couple of projects with this setup now. The first one announces when my 3D printer finishes - way better than checking every 10 minutes. The second is a shop safety monitor that calls out when measurements are out of spec.

The key is thinking about it as a notification system rather than a general-purpose voice interface. You're not going to have conversations with it, but for specific status updates or warnings, it beats looking at an LCD screen any day.

The fact that it works completely offline is huge for anything installed where WiFi is spotty or non-existent. I installed one in my garage workshop and don't have to worry about it dropping connection or dealing with cloud service outages.

Final Thoughts

This ESP32 Text to Speech project seemed more complicated than it actually is. Once you understand that the Talkie library is just playing back pre-recorded LPC data, it all makes sense. You're essentially building a very small, very specific voice synthesizer.

The hardware is dead simple - just three components plus the ESP32. The code is straightforward once you get the dictionary concept. And the end result is something genuinely useful that doesn't need internet access to work.

If you're building something that needs audio feedback, this is worth trying before going the WiFi + cloud TTS route. It's faster, more reliable, and honestly kind of fun hearing your ESP32 talk for the first time.