WhisperTranscribe
Linux push-to-talk transcription and voice command routing for the focused desktop.
WhisperTranscribe records while a hardware key is held, transcribes with whisper.cpp, then either types into the focused window or routes speech through an LLM-backed command parser.
What it solves
Desktop speech tools often break context by using clipboards, separate text boxes, or opaque voice-command layers.
WhisperTranscribe keeps the focused application as the integration point: one hardware button types transcription directly, while the other turns spoken instructions into strict GUI, terminal, or key actions.
Tray settings
Advanced settings
Who this is for
- Linux desktop users who want fast voice-to-text insertion
- People using physical hotkeys for hands-free workflows
- Developers comparing local and cloud LLM command backends
- Builders interested in evdev, PipeWire, ydotool, and tray apps
What it does
WhisperTranscribe is a Python daemon driven by the two-button MicrophoneController USB HID device. Both buttons share the same hold-to-record flow: audio is captured while the key is held, transcribed by whisper.cpp on release, and routed based on the trigger.
Scroll Lock injects transcription directly into the focused window via ydotool. Pause routes the text through a configurable LLM backend that must return one executable action: launch a GUI app or URL, run a terminal command, or inject a key sequence. A GTK3 tray exposes service state and settings without editing config files.
Workflow covered
- Capture and transcribe - watch selected evdev devices, record with PipeWire while a key is held, and run whisper.cpp on release.
- Type or command - inject direct text for transcription mode or route text into a strict command parser for LLM mode.
- Configure and supervise - use the tray to change hotkeys, backends, models, binary paths, device filters, GPU mode, and systemd service state.
Technical highlights / stack
Why it matters
The project makes AI and speech useful by narrowing the command surface. The model does not get to improvise arbitrary output; it must produce one of a few parseable actions that the daemon can inspect and execute.
Technical notes
GUI:,
TERMINAL:, or KEYS:, which keeps
execution predictable.
Hard parts
- Normalizing local and cloud LLM providers into one strict command interface.
- Handling evdev hotplug, duplicate physical devices, permissions, and filters.
- Supporting GPU and CPU whisper.cpp paths without making setup confusing.
- Building a tray app across fragmented Linux tray standards.
Engineering takeaways
- OpenAI-compatible APIs reduce code, but each provider still needs clear defaults.
- Desktop automation works best when it preserves the focused application context.
- Hotplug-aware keyboard watching needs physical-device deduplication.
- Tray UI can remove config friction, but Linux tray standards add hidden complexity.
Current scope
Works now
- Hold-to-record flow
- Direct transcription typing
- LLM command mode
- Systemd user service
- GTK tray settings
Command formats
GUI: <cmd>TERMINAL: <cmd>KEYS: <sequence>- Backend-specific model config
Branch focus
other-modelsbackend support- Input-device targeting
- GPU toggle
- OpenAI-compatible provider controls
Related work
What to do next
Review the source if you want to inspect the evdev watcher, whisper.cpp invocation, ydotool insertion, strict command parser, LLM backends, or GTK tray settings.