Conversational, Locally Hosted, Talking Skeleton Companion

by gearscodeandfire in Circuits > Robots

109 Views, 0 Favorites, 0 Comments

Conversational, Locally Hosted, Talking Skeleton Companion

S3E1 thumbnail v2.jpg
Screenshot 2025-10-28 014821.png

(Here's the full demo)


Late at night in the shop I get lonely. So, I took my skeleton, Little Timmy from my last project, and I made Operation Pinocchio: turning Little Timmy into a real boy. He is entirely locally run for privacy. He has real-time text-to-speech and speech-to-text, semantic and episodic memory, vision, and thinks pretty darn well for a lightweight model. For debugging, I got webRTC running on him so friends can telepresence into Little Timmy, look around, and type what he says. He sounds like Skeletor, and his job is to make fun of me like a resentful cohost.

Supplies

(I am an Amazon associate so I get cash money and a small portion of your soul if you buy anything with the links below)


Body:

plastic skeleton

metal skull

springs

8 mm rod

8 mm ID bearings

3d printed mounting brackets

25 kg servos (pan and tilt)

9 gm servo (jaw)

Leather jacket (optional)


Hardware:

Raspberry Pi 4

Raspberry Pi 5

USB to Audio adapter

Windows PC with RTX3060

Articulating Jaw

Making Timmy a real boy #welding #fabrication #machinist #maker #rftc #robotics #esp32 #raspi #skull
I will turn Timmy into a real boy. #robotics #piper #tts #openai #ollama #llm #homeassistant #agent
My local AI shop-buddy in process... #ollama #servo #esp32 #TTS #robotics #pipertts #maker #skull

I found these metal skulls. I think they are cast-iron, but they are easily weldable. I bought two of them because I was worried I would mess up one of them in my original project. Luckily, I didn’t mess up the original, so I could cut the jaw off of the original while preserving the rest of the face, and cut the jar off the second skull while preserving the skull. Then it involved some welding, but this easily could be done with JB Weld or similar metal adhesives to add a hinge. Then I drilled big hole in the skull, jammed in some bearings that allowed an 8 mm bar to go through them, then drilled some 8 mm holes through the jaw, bada bing bada boom, I had an articulating jaw. I tapped some holes and screwed in some 3 mm bolts and used some springs I had on hand to make sure the jar would naturally be closed. I welded on a little fastener for a lightweight servo (9 gm) on the back, and I use some fishing line to connect the mandibular arch (yeah, I’m getting anatomical) to the servo arm. Now, I needed a way to control the servo arm to make the jar look like it was opening naturally during speech.

Timmy Speaks

Timmy hates weddings. #voicecloning #homeassistant #raspberrypi #esp32 #fft #servos #skulls #40k
Workshop cohost! Voice cloning and speech synthesis meets robotics, he-man, and ghostrider.
LLM STT song reviews. #piper #whisper #ollama #raspi #imaginedragons #STT #ghostrider #servitor
Best hype-android ever. #A$APA #ferg #raspi #esp32 #arduino #pipertts #whispertts #llm #robotics
Piper TTS training a model to sound like Skeletor
Screenshot 2025-10-26 155809.png
Screenshot 2025-10-26 180255.png
Screenshot 2025-11-05 024419.png
Screenshot 2025-01-26 223355.png

I wanted the jaw to open the way human jaws open when they talk. This involved learning about something called “phonemes,” which are basically an agreed-upon nomenclature for different sounds. I focused on vowel sounds, because I think for most consonant sounds the human mouth is usually closed. Linguistics PhD’s can make fun of me now in the comments. Given my presupposition about the human mouth being closed for consonant sounds, and given that the mouth being closed is the default position, I needed to devise a way to figure out how much to open the mouth for different valves sounds.


I had a hunch that a given voice with predictable frequencies would be able to identify dominant phonemes. Based on that hunch, I needed Little Timmy to have a voice so that I could identify the frequency of his phonemes. I used Piper TTS for speech generation, specifically https://github.com/rhasspy/piper, which conveniently contains a a Python package that spins up a web UI and lets you record a bunch of quotes into a microphone to create a training data set.


I recorded 142 clips of me reading but now statements while doing my best impersonation of Skeletor from the 80s cartoon He-Man (I wanted to do Cobra Commander/Starscream, but I can’t do that impression well and couldn’t find enough clean audio). After recording the audio, I ran a Python script that literally fine tuned the model I was using (lessac_medium.onnx) and when the loss curves started to flatten, I considered it done.


Now that I had a stable voice that was essentially mimicking my own impression of Skeletor, I wrote a little web interface for I could type in different words featuring the different phonemes.


How did I measure the phonemes, one might ask? At this point in the project, the speech-to-text (STT) fine-tuned model was running on a raspberry pi 5 quite adequately, with audio output via a USB to 3.5mm headphone jack adapter. I took a 3.5 mm female headphone jack and soldered the ground to an ESP 32 dev board, left or right signal (I forget, doesn’t matter) to a resistor divider that biased the incoming AC signal to 1.65V; this made sure there was no negative voltage going to an ESP 32 analog read pin, and I could sample the signal on the analog pin, use the Arduino FFT library to convert the signal to the frequency domain, then monitor for frequency peaks. It just so happens that phonemes have different frequency peaks, and the “F1 formant” (lowest frequency peak) associated with most vowel phonemes coming out of Little Timmy’s mouth just so happens to occur between 200 Hz and 1,000 Hz. I had Little Timmy read lots of vowel sounds, I plotted them using Arduino IDE’s plot function, and I am measured the typical peaks for English common phonemes.


I really can’t believe this worked, but it just so happened to work quite well. It was huge. Now, I could identify the phoneme, and map that phoneme to typical “jaw openness” for different valves sounds. For example, the human jaw opens the most for the “ahhhh” sound, which is why doctors tell you to make that sound when they want you to open your jaw. The jaw opens much less for “oh,” “ew,” “eh,” “uh,” and so forth. Luckily, linguistic nerds made a lot of that information available, so I now had my workflow of frequency peak => phoneme => jaw openness => servo instruction.


I used a very basic 9 g servo to pull on the jaw so it opened appropriately based on the different phonemes.

Timmy Hears

Speech to text for my workshop AI friend #whisper #STT #raspi #homeautomation #llm #local #diy
Little Little Timmy, my AI cohost. #llm #robotics #tts #stt #raspberrypi #arduino #esp32 #arduino

OpenAI has a service and downloadable Python package called Whisper. People smarter than me figured out how to turn this into something called faster-whisper, which apparently uses something called Ctranslate, and this is now basically outside of the scope of this tutorial. Anyway, I initially had faster-whisper tiny_en (tiny English) model running on a raspberry pi 5, connected to an input from my microphone. It did a pretty good job converting my speech-to-text (STT).


The problem I had to overcome was having the STT model send the text payload as quickly as possible at the appropriate time, as I needed Little Timmy to respond rapidly, like a human. The Python package contains something called Silero Voice Activity Detection (VAD) and I leverage this to detect any time there was 0.5 seconds of silence and determined that as the queue to send a text payload to my AI preprocessor. A tricky part was that, on instances when I went on long diatribes, which are frequent, I needed it to transcribe in chunks without overlapping text in the output. It was a whole thing. This will be on get help at some point, but I figured out a way to transcribe speech of variable lengths, without repeating text, and send the final payload after a 0.5 second silence.


It worked pretty well on a raspi 5, but it still added about 0.5 – 0.75 seconds of latency before Little Timmy responded, which kind of ruined the illusion of me talking to a Real Boy. At this point in the project, since my Jetson Orin Nano had not yet arrived, I had purchased a functioning PC for a very fair price on Facebook Marketplace that just so happen to have an RTX 3060 graphics card in it with 12 GB of RAM (these cards tend to sell for around $300 on the secondary market, 12 GB VRAM is huge at that price, and I got a whole functioning computer for $600 so it was a win).


I moved the model from the raspi5 to the PC, got faster-whisper to run straight up in Windows terminal quite easily, then got a little greedy. I upgraded from tiny_en to base_en (this is the next model bigger). Since I was really greedy, and since it was later in 2025, I asked Claude in Cursor to write a Python script to record any time it heard me talking, familiarize itself with the github of faster-whisper, and record sound clips of appropriate length of me talking for approximately 1.5 hours. I clean my shop with my microphone on and basically talked for 90 minutes, using words I typically use, my local accent, and referencing tech objects (ESP32, I2C, etc.) that I would want it to appropriately transcribe. Then I downloaded the largest model of whisper that I could find to perform STT on the recordings. I corrected very rare errors in transcription, and bada bing bada boom, I had a training data set upon which whisper_base could be fine-tuned on my own voice. Since I was feeling cocky at this point, I also ensured it was fully using CUDA, which further decreased latency. The results were incredibly accurate, on par with Dragon Professional (which I use routinely) and quite plausibly superior.

Timmy Thinks

My AI cohost hates me. #tts #stt #ollama #llm #robotics #speech #raspberrypi #esp32 #fft
My ai cohost hates me. whisper STT, ollama llama3.2 3B, custom RAG, piper TTS, and local compute.
My AI cohost is a worthy adversary.
My locally run AI cohost #esp32 #raspi #tts #stt #mechatronics #robotics #ollama #servos #assistant
Screenshot 2025-11-05 024639.png

This was a slog. I had minimal experience with locally run large language models, and I didn’t really understand yet that their job is to take a text payload input and given output. That’s all that they do. I quickly realized that any information I gave my model was not preserved in the simple GUI web interface I spun up for it. I used ollama, initially in WSL2 Ubuntu 20.xx on one of my Windows machines, and for LLM nerds, I used the generate() endpoint, not the chat() endpoint (initially out of ignorance, but later for reasons).


The absolute weirdest point of this project was when I gained a primitive understanding of vector embedding and vector retrieval (RAG aka retrieval augmented generation is now a pretty standard thing that people who play around with LLM’s are familiar, but in my defense, it was early 2025). So why was it weird? I initially gave Little Timmy’s brain semantic memory without episodic memory. What does that mean? It means information that it learn from me it considered general knowledge, but it had no recollection of learning that knowledge from me. For example, my test prompt was always, “My cat’s name is Winston, he is a Cornish Rex.” My test question in a subsequent session was always, “What is my cats breed?” When the model responded, “I am a helpful AI assistant and we have never spoken before, so I don’t know anything about WINSTON,” I got pretty pleasantly freaked out; Little Timmy automatically knew my cats name was Winston without having any context for learning that information.


Fast-forward: I learned how to insert the entire chat history into each LLM prompt, I figured out how to write a system prompt that properly injects vector retrieved memories. Relevant memories are injected based on the most recent user prompt which gets vector embedded and stored based on importance, and relevant memories are retrieved based on the most recent user prompt with filtering for relevance, importance, and time (I introduced a time stamping system so that Little Timmy would have a reference point for current time and when relevant memories were created). There are some other lightweight processing functions involved, such as a summarization function that summarizes the most recent user prompt before vector embedding that in searching for relevant memories, as well as a tagging function that establishes importance and type of data (callback joke, type of data, etc.).


All of this occurs in preprocessing that was easiest to do in Ubuntu 22.xx in WSL2, using pgvector for vector storage and searching. For reasons I don’t entirely understand, everything works better using ollama in native Windows terminal rather than in WSL2 linux.


Little Timmy’s personality is meant to be kind of a jerk that resents me and makes fun of me while still answering questions, kind of like the show Space Ghost Coast to Coast. It wasn’t trivial making Little Timmy do all these things while still preserving his personality, and while preventing him from degenerating into a “helpful AI assistant.” I had the best luck with the model Llama 3.2 3B quantized to 4 bits; it easily fit memory, could handle a very large context size (a.k.a. short-term memory), and provided generally witty burns. Typically, the system prompt comes first in a user/assistant/user/assistant text payload when sending a payload to an LLM. In my use case, I kept a local text copy of the entire conversation from the latest session, and every time I sent a payload to the LLM, I dynamically inserted the system prompt immediately before the most recent user prompt. This preserved Little Timmy’s personality while also ensuring relevant memories that were injected into the system prompt were considered in the following response (I had a pipe dream that this would also allow kv caching to be implemented to minimize computation on each call, but I am uncertain whether I ever got that to work properly).


The response from ollama underwent some very light formatting (removing parentheses, asterixis, and other things that were hard to pronounce). The response was then sent off to a separate instance (originally on a separate raspberry pi, later on the same machine but in a virtual environment on Windows instead of WSL2 Ubuntu) where Piper TTS converted it into audio.

Timmy Moves… and Fire

He is the sum of his parts. #esp32 #raspi #tts #stt #mechatronics #robotics #ollama #nin #ai #local
Butane, solenoids, relays, and raspberry pi. Self-hosted ai with webrtc, vision, STT, TTS, and fire.
Screenshot 2025-11-05 022610.png
Screenshot 2025-11-05 022647.png
Ghost Rider RPi runs local AI with fire. #esp32 #solenoid #servo #RAG #ollama #thecult

This step is much more straightforward, so if you made it this far, it’s kind of like a fairy fountain in Zelda.


The main single board computer that handles Little Timmy’s movements and vision/video streaming (described below) is a Raspberry Pi 4 (for reasons described below). Raspi’s in general are great at using servos, because servos expect a certain frequency signal, and the fact that raspi’s run an operating system somehow makes us hard for them to do (don’t ask, but it’s a thing). I already had a servo in the skull to control the tilt using a timing belt and some pulleys, and a servo connected to the shaft of the rod controlling the pan functioning. Both of these servos were previously controlled by an ESP 32, a microcontroller that does a really good job of sending out pulse signals for servos unlike it’s raspi cousin. I didn’t want to insert yet another ESP 32 into the mix, but likely a remembered impulse ordering a few integrated circuits called “Serial Wombats.” They are pretty awesome, and their developer is a super cool guy. I used one Serial Wombat connected to the raspi4 over i2c, and using the available Python package for this hardware, I could easily control the positions of the servos based purely on high level python control via i2c via Serial Wombat magic.


The fire part is a little outside the scope of this tutorial, but let’s just say it involved a silicone heat break, butane instead of propane (for size reasons), two 3.3V relays, a 12V power supply for the solenoid controlling butane flow, and a 3.3V 18650 battery for an AC high-voltage igniter. I used a battery for the igniter because that thing throws out all sorts of electromagnetic interference (EMI) that really kind of ruins your day with other electronics (a previously encountered this with my front yard walkway flame-poofers). After a variety of tests with timing, I figured out ways to prime the skulls butane, make simple flamepoofs, and make sustained flames. I was incredibly safe with all of this, including how I allowed control of this over the Internet…

Timmy Sees

I'm turning little Timmy into a real boy. #pipertts #serialwombat #servo #esp32 #raspi #opencv
Screenshot 2025-11-05 022323.png
Screenshot 2025-11-05 022405.png
Screenshot 2025-11-05 022341.png

This was really a three-part endeavor.


The first part was allowing simple face tracking, nothing groundbreaking, good old Python running open CV and using a frontal haarcascade to identify faces looking directly at the camera mounted in the left eye. The camera originally had a narrow field of view, and that was an issue, so I replaced it with another arducam with a 135° field of view, added some constants for bounding boxes in the code for where my face was identified, and added some instructions to move the servos to adjust my face to be back in the center if it was too far off center. Literally stuff Michael Reeves was doing nine years ago, but it worked and I’ll take it.


I wanted Little Timmy to see, and that involved captioning whatever image was going into his camera. The captioning was interesting. Initially, I used a LLava model (trained on Llama 3 with image recognition and text output), but for reasons I don’t fully understand, Little Timmy became a polite helpful AI assistant when I use this model. I couldn’t make him not polite, and it was an issue. Switching back to Llama 3.2 immediately fix that, and quite honestly the Llava models much bigger (7B vs 3B) than what I was using, so I needed a workaround. I went with a very lightweight BLIP image captioner, not even the most recent version of BLIP, because I still needed all of this to fit in 12 GB of VRAM on my Facebook Marketplace PC with an RTX 3060 with 12GB VRAM. The captioner was not sophisticated, but it was sufficient. It could even take in prompts for its captioning, but that drastically increased the scope of the project and was deferred. Interestingly, I added a function called “rolling captioning” where every two – three seconds whatever image the camera observed was captioned and injected into the system prompt for relevant context. While this worked, Little Timmy almost immediately either lost his personality, failed to incorporate relevant memories, or just flat out hallucinated. In the end, I added some preprocessing so that image captioning would only be injected into the system prompt if I was specifically asking Little Timmy a question about what he saw, and that same system prompt did not inject relevant memories if that was the case, and temperature was turned way down so he didn’t literally hallucinate. It works, but I’d like to improve it at some point, preferably with a mixed modal model and better hardware.


I wanted Little Timmy to know whether I was there or not. At this point, it was early summer 2025, and AI assisted coding was all the rage. I told Claude Sonnet 4 at that point to record images for a while, and I walked around my workshop and looked at the camera. I then told Claude to make a light weight image classifier and to use my face is training data to identify my face as “Dan.” I had to install a transformer Python package, and I’m pretty sure it used ViT/CLIP to perform the actual image recognition, but the magic is i’m not entirely sure how, but it did work.


Little Timmy knew if I was there or not.

Telepresence

Screenshot 2025-11-05 022939.png
Screenshot 2025-11-05 022959.png
Screenshot 2025-11-05 023020.png
Screenshot 2025-11-05 023059.png
Screenshot 2025-11-05 024526.png

While I was struggling with large language models, I wanted to figure out a way to have somebody telepresence into Little Timmy both for troubleshooting and testing, but also because the theme of this project is about companionship when I am by myself in the shop. This was the whole thing, and explaining how I got it to work is something that a) I cannot do and b) is outside the scope of this tutorial.


Long story short, I got webRTC to run on a raspberry pi 4. It involved a self-hosted security certificate and using a Twillio account (basically an Internet phone service which I already had for home automation related purposes) to get things working like TURN servers and STUN servers to get things like ICE candidates connected. Basically, my understanding is most low latency videoconferencing works using this platform, which identifies two parties that want to talk to each other, mitigates away for them to talk to each other, and connects them to each other privately without understanding specific details about each other for security reasons. It was a whole thing.


Ultimately, Little Timmy, independent of AI, spun up a secure web server that my friends could connect to over the Internet. This provided them an almost real-time video feed and controls to make the skull look around, a text box where they could type text in immediately have Little Timmy perform TTS, and even control the butane, which is an exceedingly ill-advised idea (but it was pretty awesome). I demoed it in a video with my friend Tomasz, who is a far more successful YouTuber than I.

Future Documentation

Screenshot 2025-10-28 014821.png

Dear reader, thank you for making it this far. I am presently trying to divine how to upload the many different components of this project to github without compromising my network security by inadvertently including passwords or other sensitive information. If you have questions, please ask them, because I publish these projects for the community (I would be doing them otherwise but I like to share them). Rock on.