After two years of bouncing between Claude desktop, ChatGPT voice, Gemini, and a half-dozen Ollama frontends, I got tired of the wake-word thrash. Every assistant assumes you’ve picked their team forever.
So I built BRAGI — a voice layer that runs locally, listens locally, and routes to whichever AI I tell it to. Including the one running on the same machine.
This post is the architecture, not a sales pitch. If you’ve been thinking about building something similar, here’s what I learned shipping v0.2.
The pipeline
Mic input
↓
openwakeword (local) — “Hey Jarvis”
↓
faster-whisper medium (local, GPU optional)
↓
Provider router (settings UI picks destination)
↓
[Cloud: Claude / OpenAI / Gemini / Grok / Groq / Together / HuggingFace]
[Local: Ollama / LM Studio / FREYA / Echo]
↓
TTS (eSpeak free, OpenAI Nova BYOK)
↓
Speaker output
Audio never leaves the machine. Only transcribed text goes to whichever cloud you picked, if any.
Wake word
openwakeword is the right call for a sovereign product. Picovoice is better quality but locks you into a paid commercial license. openwakeword is Apache 2.0 and runs on CPU.
The catch: training your own custom model requires matching the feature dimensions to whichever preprocessor version you’re targeting. I burned half a day on a model that had 96×103 features when openwakeword expected 32×147. v0.2 ships with the stock “Hey Jarvis” model and includes the custom “Hey BRAGI” model for users with compatible hardware.
STT
faster-whisper medium on CUDA is the sweet spot. Tiny is too inaccurate for real conversation, large is overkill for short voice commands. Medium gets ~1 second latency on a midrange GPU and handles bilingual input out of the box.
Critical detail: instantiate Whisper once at startup, never per-request. First inference call takes 5-10 seconds to warm CUDA. Users won’t tolerate that on every wake.
The router
This was the hardest part. Each provider has a different SDK, different streaming format, different auth pattern. The router abstracts that into one interface:
class Provider(Protocol):
def name(self) -> str: ...
def is_ready(self) -> bool: ...
async def respond(self, prompt: str, history: list[Message]) -> AsyncIterator[str]: ...
Each provider implementation handles its own SDK quirks. The router just picks one based on user settings or voice command (“BRAGI, switch to Claude”) and calls respond().
For local models I support both Ollama (HTTP API) and LM Studio (OpenAI-compatible HTTP API). Both run on the user’s machine. Both look identical to the router.
TTS
eSpeak ships with the installer because it’s free, offline, and 100+ languages. It sounds robotic. That’s fine. People who want premium voice can paste an OpenAI API key and use Nova.
I tried Kokoro for higher-quality offline TTS. Worked great in dev. Production builds kept hitting a 404 on the default voice file in HuggingFace. Shipped with eSpeak as the default and Kokoro as best-effort.
The settings UI
Local web UI on http://127.0.0.1:7777. Configure providers, paste API keys, pick voices, manage license. Page lives on the user’s machine. No account, no login, no cloud dashboard.
API keys live in a local vault. They never leave the machine. The product is sovereignty — that has to be true at every layer.
Stack
- Python 3.11
- openwakeword for wake detection
- faster-whisper for STT
- eSpeak / OpenAI Nova for TTS
- FastAPI for the local settings server
- pythonw.exe in tray mode for daily use
- PyInstaller for bundling
- NSIS for the Windows installer
- ~169MB installer, Win10/11
What I’d do differently
-
Custom wake word training is harder than the docs admit. openwakeword’s preprocessor is versioned and the feature dims have to match exactly. Document this for users who want to train their own.
-
PyInstaller + 4GB CUDA torch builds blow past NSIS’s 2GB single-file limit. I had to move torch + Kokoro to a first-run download instead of bundling them.
-
Don’t trust the embedded Python’s
python311._pthdefaults. User-site contamination from%APPDATA%RoamingPythonwill silently break your install. Always launch with-s -Eflags.
What’s next
v0.3 will likely add: better Kokoro fallback, custom wake word training UI, multi-room concurrency. The architecture supports it — I just need to ship v0.2 first and see what users actually ask for.
If you want to see it: clintwave84.gumroad.com/l/leetkd
If you’ve built something similar and want to compare notes — drop a comment. Especially curious how others have handled the provider abstraction across cloud + local.
— Built by one guy in Idaho. Snake River AI.
