Voice Pipeline

CoreLayer includes a voice pipeline that lets you interact with Jarvis through speech.

Voice Pipeline Overview

Wake Word Detection
  → Speech Recognition (ASR)
  → Intent Processing (LLM)
  → Text-to-Speech (TTS)
  → Audio Output

  ↕ Barge-in interruption at any point

Configuration

{
  "voice": {
    "enabled": true,
    "wakeWord": {
      "enabled": true,
      "keyword": "jarvis"
    },
    "asr": {
      "provider": "web-speech",
      "language": "en-US"
    },
    "tts": {
      "provider": "mimo",
      "streaming": true,
      "sentenceLevel": true
    }
  }
}

Wake Word

Uses Picovoice Porcupine for wake word detection:

Runs locally in the browser/webview
Low CPU usage when idle
Configurable keyword (default: "jarvis")

Speech Recognition (ASR)

Two providers available:

Provider	Type	Quality
Web Speech API	Browser-native	Good, no API key needed
Groq Whisper	Cloud	High accuracy, requires API key

Text-to-Speech (TTS)

Provider	Type	Features
MiMo TTS (`mimo`)	Cloud	Sentence-level streaming, natural voice

MiMo TTS is a cloud-based text-to-speech service that supports sentence-level streaming — it starts speaking each sentence as soon as it's generated, rather than waiting for the full response. This reduces perceived latency significantly.

Streaming TTS speaks responses as they're generated, reducing perceived latency.

Barge-in Interruption

When enabled, you can interrupt Jarvis mid-response by speaking. The pipeline will:

Stop the current TTS playback
Process your new input
Respond to the interruption

Voice Overlay

When voice is active, a floating overlay appears showing:

Current state (listening, processing, speaking)
Waveform visualization
Transcription text

Next Steps

Enable Voice Workflow — step-by-step setup
Core Concepts: Jarvis — how voice integrates with the assistant

On this page