WhisperTranscribe

Linux push-to-talk transcription and voice command routing for the focused desktop.

WhisperTranscribe records while a hardware key is held, transcribes with whisper.cpp, then either types into the focused window or routes speech through an LLM-backed command parser.

What it solves

Desktop speech tools often break context by using clipboards, separate text boxes, or opaque voice-command layers.

WhisperTranscribe keeps the focused application as the integration point: one hardware button types transcription directly, while the other turns spoken instructions into strict GUI, terminal, or key actions.

WhisperTranscribe tray settings Tray settings
WhisperTranscribe advanced tray settings Advanced settings

Who this is for

  • Linux desktop users who want fast voice-to-text insertion
  • People using physical hotkeys for hands-free workflows
  • Developers comparing local and cloud LLM command backends
  • Builders interested in evdev, PipeWire, ydotool, and tray apps

What it does

WhisperTranscribe is a Python daemon driven by the two-button MicrophoneController USB HID device. Both buttons share the same hold-to-record flow: audio is captured while the key is held, transcribed by whisper.cpp on release, and routed based on the trigger.

Scroll Lock injects transcription directly into the focused window via ydotool. Pause routes the text through a configurable LLM backend that must return one executable action: launch a GUI app or URL, run a terminal command, or inject a key sequence. A GTK3 tray exposes service state and settings without editing config files.

Workflow covered

  1. Capture and transcribe - watch selected evdev devices, record with PipeWire while a key is held, and run whisper.cpp on release.
  2. Type or command - inject direct text for transcription mode or route text into a strict command parser for LLM mode.
  3. Configure and supervise - use the tray to change hotkeys, backends, models, binary paths, device filters, GPU mode, and systemd service state.

Technical highlights / stack

Speech
Python 3.9+ whisper.cpp Vulkan PipeWire
Input/output
python-evdev ydotool USB HID systemd
LLM backends
Ollama Claude CLI OpenAI Groq OpenRouter LM Studio
UI
GTK3 D-Bus StatusNotifierItem pytest

Why it matters

The project makes AI and speech useful by narrowing the command surface. The model does not get to improvise arbitrary output; it must produce one of a few parseable actions that the daemon can inspect and execute.

Technical notes

Device watching The daemon handles hotplug, duplicate event nodes, permission errors, and configured input-device filters.
Direct insertion ydotool injects text and key sequences into the active window without touching the clipboard.
Command grammar LLM output must be GUI:, TERMINAL:, or KEYS:, which keeps execution predictable.
Tray control GTK3 StatusNotifierItem and DBusMenu provide service toggle, settings, model picker, hotkey capture, and backend config.

Hard parts

  • Normalizing local and cloud LLM providers into one strict command interface.
  • Handling evdev hotplug, duplicate physical devices, permissions, and filters.
  • Supporting GPU and CPU whisper.cpp paths without making setup confusing.
  • Building a tray app across fragmented Linux tray standards.

Engineering takeaways

  • OpenAI-compatible APIs reduce code, but each provider still needs clear defaults.
  • Desktop automation works best when it preserves the focused application context.
  • Hotplug-aware keyboard watching needs physical-device deduplication.
  • Tray UI can remove config friction, but Linux tray standards add hidden complexity.

Current scope

Works now

  • Hold-to-record flow
  • Direct transcription typing
  • LLM command mode
  • Systemd user service
  • GTK tray settings

Command formats

  • GUI: <cmd>
  • TERMINAL: <cmd>
  • KEYS: <sequence>
  • Backend-specific model config

Branch focus

  • other-models backend support
  • Input-device targeting
  • GPU toggle
  • OpenAI-compatible provider controls

What to do next

Review the source if you want to inspect the evdev watcher, whisper.cpp invocation, ydotool insertion, strict command parser, LLM backends, or GTK tray settings.