A live caption app for macOS.
Privacy first, light weight, friendly user experience for macOS users. What happens on your device, stays on your device.
- Privacy First – No cloud, no analytics, no ads, no internet required, and no screen capture access.
- Lightweight & Fast – Runs efficiently, with up to 1.7× faster word-level performance, 10% latency reduce compared to default live caption.
- Minimalist Design – One-click on/off, no distractions. Less is more.
- Open Source – Free and transparent.
Promo.mp4
Livcap.demo0.mp4
Livcap-demo1.mp4
Livcap-demo2.mp4
🎉 v1.0 Now Available on the App Store!
Download Livcap from the Mac App Store
Livcap outperforms macOS's native Live Caption with significant improvements:
✅ 1.7x faster word-level lead rate
✅ 10% lower latency
✅ More efficient processing with better resource utilization
See detailed comparison benchmarks in
livcapComparision.md
Our performance gains come from three key optimizations:
🎯 Single-pass inference - Uses one SFSpeechRecognizer call instead of multiple inferences observed in native Live Caption
⚡ Smart downsampling - Converts audio from 48kHz to 16kHz before processing, maintaining quality while reducing computational overhead
🔇 VAD-based silence skipping - Voice Activity Detection prevents unnecessary processing during silent periods, saving resources and improving responsiveness
Complete local processing with zero external dependencies:
🔒 No cloud services - Built entirely on Apple's native SFSpeechRecognizer framework, ensuring all speech processing happens locally on your device
🎵 Direct audio access - Uses CoreAudio Tap to capture system audio directly from the buffer, eliminating the need for ScreenCaptureKit or screen recording permissions
🛡️ Zero data transmission - Your conversations never leave your Mac - no servers, no analytics, no tracking
Development History
- Compare the whisper.cpp and built-in SFSpeechRecognizer.
- 3 Approaches audio arch:
- VAD-Based Silence Detection
- 5-Second Fixed Sliding Windows
- 30-Second WhisperLive-Inspired Buffer
tccutil reset All com.xxx.xx
Based on SFSpeechRecognizer from the apple built-in framework.
Approach 1: VAD-Based Silence Detection ✅ **Most Reliable**
Files: BufferManager.swift
, VADProcessor.swift
, EnhancedVAD.swift
How it works:
- Accumulates speech until 3 consecutive silence frames
- Triggers inference on speech end or 15s maximum
- RMS threshold (0.01) with asymmetric hysteresis
Characteristics: Event-driven, variable buffer, speech-only segments
Status: ✅ Best balance of quality and usability
Limitations: Variable latency, potential word cutoff, VAD tuning needed
Approach 2: 5-Second Sliding Windows ❌ **Word-Level Chaos**
Files: ContinuousStreamManager.swift
, TranscriptionStabilizationManager.swift
How it works:
- 5s sliding window with 1s stride (4s overlap)
- LocalAgreement algorithm for word-level stabilization
- Temporal overlap analysis for conflicts
Characteristics: Fixed 1s intervals, 5s buffer, word-level matching
Status: ❌ Overlap analysis creates transcription instability
Limitations: Complex word matching, frequent text changes, poor readability
Approach 3: 30-Second WhisperLive ❌ **High Latency**
Files: WhisperLiveContinuousManager.swift
, WhisperLiveAudioBuffer.swift
How it works:
- Continuous 30s audio buffer
- 1s inference intervals with smart trimming
- Pre-inference VAD for speech extraction
Characteristics: Fixed 1s intervals, 30s context, maximum Whisper context
Status: ❌ >2s latency unsuitable for real-time
Limitations: Excessive latency, high overhead, memory intensive
After extensive testing of all three approaches:
-
Approach 1 (VAD-Based) is currently the most practical solution, providing the best balance of quality and usability despite variable latency.
-
Approach 2 (5s Sliding) suffers from word-level chaos due to complex overlap analysis, making transcriptions unstable and hard to read.
-
Approach 3 (30s WhisperLive) provides excellent context but has unacceptable latency (>2s) for real-time applications.
Comparison Chart
Aspect | Approach 1: VAD-Based | Approach 2: 5s Sliding | Approach 3: 30s WhisperLive |
---|---|---|---|
Trigger | Silence detection | Fixed 1s intervals | Fixed 1s intervals |
Buffer Size | Variable (up to 15s) | Fixed 5s sliding | Variable (0-30s) |
Overlap | None | 4s temporal overlap | Continuous context |
Latency | Variable (silence-dependent) | Predictable 1s | Predictable 1s |
Context | Speech segments only | 5s windows | Maximum 30s context |
Stabilization | None | LocalAgreement | Pre-inference VAD |
We welcome contributions! Please read our Contributing Guidelines before submitting PRs.
Key Requirements:
- Privacy first (no data collection/network features)
- Lightweight performance (maintain efficiency)
- Simple UI design (minimal interface)
- Follow PR template with motivation, code summary, AI assistance docs, and demo(optional)
invalid display identifier 37D8832A-2D66-02CA-B9F7-8F30A301B230
when happend at the monitor changing.
- Compare new API SpeechAnalyzer when macOS 26 is released (non-beta). Nov 2025.
- Implement MLX whisper and compare performance. Oct 2025.
- Add KV cache support.
- Tokenizer support.
- Quantization support for speed up
- Explore hybrid approaches combining the best aspects of each method
- Investigate adaptive buffer sizing based on speech patterns
- Optimize VAD parameters for different acoustic environments
MLX-Swift only supports safetensors files. Use Utilities/convert.py
to convert .pt files to .safetensors format.
Required Files:
Livcap/CoreWhisperCpp/ggml-base.en.bin
Livcap/CoreWhisperCpp/ggml-tiny.en.bin
Livcap/CoreWhisperCpp/ggml-base.en-encoder.mlmodelc
Livcap/CoreWhisperCpp/ggml-tiny.en-encoder.mlmodelc
Livcap/CoreWhisperCpp/whisper.xcframework