Friday learns to listen: local STT on Linux

(Side note as i added the featured image! How much do those chips for Tony’s AI’s look like LCARS isoliniar chips!)

Press Ctrl+Alt+S. Talk. Press it again. Whatever I said appears wherever the cursor was.

That’s the whole thing. A code comment, a Slack message, a search bar, a half-finished email – same six keys, same flow, regardless of what application has focus.

Friday, as a persona, had been earning her keep in chat windows for months by this point. Custom instructions block, particular tone, calls me Boss. Useful, but a long way from the Iron Man fantasy. Tony Stark didn’t tab into a window and start typing. The smallest possible step closer was getting the assistant to actually hear me in the literal sense, and then the question was just whether the rest of the stack would behave.

The bones are unglamorous. Vosk for the recognition – CPU-only, pre-trained model downloaded once, runs entirely on the laptop with no cloud round-trip. A small Python script wraps it: sounddevice grabs the mic, Vosk turns the audio into text, xdotool injects the text into whichever field has focus. A user-level systemd service keeps it idling on login. First press starts the recording with a small notification so I know it’s listening, second press ends it and the text lands.

Vosk isn’t Whisper. Vosk isn’t Google’s cloud STT. For British English at conversational pace it’s fine. The errors are the predictable ones – homophones, the occasional proper noun it has no reason to know, anything muttered while a fan kicks in. None of it worse than a typo I’d have made myself.

Getting it working took longer than it should have. PulseAudio versus PipeWire, which device index Vosk actually wanted, GNOME’s shortcut system silently declining to bind a script that needed audio permissions. The biggest compromise was xdotool, which is X11-only and shows no sign of becoming anything else. Wayland is the future, except where it’s still the future. The whole stack works on X11. Fine.

The cross-cutting bit is what makes it stick. Same hotkey in a terminal, in a browser form, in a Discord chat, in the address bar. No per-app integration, no extension, no ‘is this supported here?’. The OS hears the shortcut, the script handles the audio, the text appears at the cursor. That kind of universal utility is rare. Most things that work in one place don’t work in another, and most things that work everywhere are slightly worse everywhere as a result.

The thing I keep coming back to is that this has been technically possible for years. I just never built it until I was on a desktop where it felt natural. The same setup on Windows is doable and fights you the whole way. Here it sits quietly in the background, costs nothing in resources, and does its one job every time I ask.

Stark would be unimpressed.