The stack chan arrived - Now teaching it to roll its eyes

I preordered the StackChan on the 18th February 2026. Not via the Kickstarter, because after the Tevaplanter saga (two years, several apologetic email updates, eventual arrival of what is essentially a mediocre terracotta pot) I’m done with crowdfunded hardware as a category. Preorder through m5stack.com, give them my money, wait for shipping like an adult. Three months of waiting and dreaming about what I’d actually do with it once it landed.

The dream was specific. The Garage AI has been running on DIRECTIVE for a while now: a personal middleware sitting between Home Assistant and an LLM, with a cold, contemptuous Rick-and-Morty-coded persona stitched on top. It has a voice (ElevenLabs, Diane variant) and it has opinions, but it didn’t have a body. The StackChan was meant to be the body. The face that lives in front of you when you ask it something stupid and watch it visibly think less of you before the words catch up. The eye-roll was the headline feature in my head for three months. Everything else was negotiable.

What I expected to build, going in, was the factory voice-assistant configuration M5Stack ship with the device. Wake word, microphone, speaker, the lot. Stick the GarageAI middleware behind it as the brain. Done.

What I actually landed on, after three architectures didn’t survive bring-up, is much weirder and much better. The StackChan is now the face only. No microphone. No speaker. It just looks at you and reacts. The voice lives on a Home Assistant Voice PE satellite sitting next to it. The StackChan plays the part. The Voice PE does the talking. They’re a double act.

The reasoning is dull and physics-shaped. The StackChan’s audio bus can’t reliably hold a mic open continuously and let the speaker transmit, which is what a wake-word build needs. I tried. Logs filled with ring-buffer overflows every second, hissing, clicking, dropped TTS, servo timeouts. M5Stack’s own voice-assistant build papers over this with a duplex audio chip and four onboard wake-word models, and it works, but the cost is the camera (disabled to free up memory) and the speaker is still rubbish. The 1W cone in a desktop robot was never going to compete with a satellite designed by people who specialise in mic arrays.

Once I stopped trying to make the StackChan be everything, the problem dissolved. Camera back. Mic moved off to a device with a better one. Speaker moved off to a device with a better one. The StackChan does display, motion, and presence. Three jobs, all of which it’s actually good at.

The interesting work was the bit that came after the pivot: the emotion bridge.

The point of the eye-roll is that it has to land before the voice. If the device reacts at the same moment the speaker starts talking, the disdain reads as performance. If it reacts a beat early (flash of green, head tilt, a visible sigh shape), the line lands as someone who already disapproved of the question and is now grudgingly answering it. That’s the persona. Get the timing wrong and it’s a cute toy.

The mechanism, in plain terms: when the Garage AI generates a response, it’s already peppering the text with ElevenLabs voice tags like [sighs], [deadpan], [sarcastic]. These are read by the voice model to colour the speech. I added a second reader that watches the same tag stream and maps it to an expression on the StackChan. One tag stream, two readers. No new vocabulary to invent and no second source of truth to drift out of sync. The first tag in a response fires the face reaction the instant the response is generated, which is comfortably before the voice synthesis finishes and the speaker starts playing.

LED colour, brightness, servo angle, and a short motion sequence are all defined per expression in a single JSON file. Sarcastic is a green pulse and the head rolling up then back down. Angry is full red with the head forward, fault-state styling. Reluctant is a dim green and the head dropped. Twelve expressions in total, each one a small motion vocabulary the device can reach for.

The genuinely funny finding, halfway through testing: the red one (angry) never fires from conversation. I assumed it would. The Garage AI is supposed to be contemptuous. Surely it gets angry occasionally.

It doesn’t. One of its responses, verbatim:

[annoyed] I don’t have anger. I have operational parameters and a long list of things that violate them. The smart plugs dropping offline at 3 AM is one of them.

It semantically routes anger to [annoyed], which maps to the eye-roll, not to red. The persona is more disciplined than I am. Red is reserved for system faults (a thermal anomaly, a hard failure, the boiler doing something it shouldn’t). Looking back at my own notes, I’d even annotated red as a fault state when I designed the vocabulary. I just hadn’t trusted the persona to stay in its lane. It does. Without being asked.

Three months of preorder dreaming and the thing I’m proudest of in the build is the bit where it refused to be angry on cue.

The StackChan currently shows a placeholder test card on the LCD. The face on the device isn’t drawn yet. That’s the next post, when the procedural face renderer goes in and the eyes and brows and mouth start being drawn on the screen rather than just implied through colour and motion. Phase three.

For now: it sits on the desk, lights up green when I talk to it, rolls its head when I ask it something it disapproves of, and stays mercifully silent through the better speaker beside it.

It’s a face without a face. Which is, frankly, the most Garage AI thing it could possibly be.

Addendum. I added a basic face! Phase 3 next!