You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+33-16Lines changed: 33 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,20 +37,27 @@ uv run ruff check
37
37
38
38
# Fix linting issues automatically
39
39
uv run ruff check --fix
40
+
41
+
# Setup pre-commit hooks
42
+
uv run pre-commit install
43
+
44
+
# Run pre-commit hooks manually
45
+
uv run pre-commit run --all-files
40
46
```
41
47
42
48
## Architecture Overview
43
49
44
-
This is a Windows push-to-talk speech-to-text application with dual interfaces (GUI and console) that uses OpenAI's API for transcription and text refinement.
50
+
This is a cross-platform push-to-talk speech-to-text application with dual interfaces (GUI and console) that uses OpenAI's API for transcription and text refinement.**Now supports Windows, MacOS, and Linux.**
45
51
46
52
### Core Components
47
53
-**PushToTalkApp** (`src/push_to_talk.py`): Main orchestrator with configuration management and dynamic updates
48
54
-**ConfigurationGUI** (`src/config_gui.py`): Persistent GUI interface with real-time status management
49
55
-**AudioRecorder** (`src/audio_recorder.py`): PyAudio-based recording with configurable audio settings
50
56
-**Transcriber** (`src/transcription.py`): OpenAI Whisper integration for speech-to-text
51
57
-**TextRefiner** (`src/text_refiner.py`): GPT-based text improvement and correction
52
-
-**TextInserter** (`src/text_inserter.py`): Windows text insertion via clipboard or sendkeys
53
-
-**HotkeyService** (`src/hotkey_service.py`): Global hotkey detection requiring admin privileges
58
+
-**TextInserter** (`src/text_inserter.py`): Cross-platform text insertion via pyautogui and pyperclip
59
+
-**HotkeyService** (`src/hotkey_service.py`): Global hotkey detection (Windows admin privileges required)
60
+
-**Utils** (`src/utils.py`): Cross-platform audio feedback using pygame and numpy
54
61
55
62
### Entry Points
56
63
-**main_gui.py**: GUI application with persistent configuration interface
@@ -62,7 +69,7 @@ This is a Windows push-to-talk speech-to-text application with dual interfaces (
62
69
2. User releases hotkey → Recording stops, audio saved to temp file
63
70
3. Audio sent to OpenAI Whisper for transcription
64
71
4. Raw text optionally refined using GPT models
65
-
5. Refined text inserted into active window via Windows API
72
+
5. Refined text inserted into active window via cross-platform APIs
66
73
67
74
### Configuration System
68
75
-**File-based**: `push_to_talk_config.json` for persistent settings
@@ -72,19 +79,20 @@ This is a Windows push-to-talk speech-to-text application with dual interfaces (
72
79
73
80
## Key Technical Details
74
81
75
-
### Windows-Specific Requirements
76
-
-**Administrator privileges**: Required for global hotkey detection
77
-
-**pywin32**: Used for Windows text insertion and audio feedback
78
-
-**Audio permissions**: Microphone access required for recording
82
+
### Cross-Platform Support
83
+
-**Text insertion**: Uses pyautogui and pyperclip for cross-platform compatibility
84
+
-**Audio feedback**: Uses pygame mixer with numpy tone generation
85
+
-**Hotkey detection**: Uses keyboard library (Windows admin privileges still required)
86
+
-**GUI**: Tkinter for cross-platform interface
79
87
80
88
### Audio Processing
81
89
-**Sample rates**: 8kHz-44.1kHz supported, 16kHz recommended for Whisper
82
90
-**Formats**: WAV files for temporary audio storage
83
-
-**Feedback**: Optional audio cues using Windows winsound module
91
+
-**Feedback**: Cross-platform audio cues using pygame with pure tone generation (880Hz start, 660Hz stop)
84
92
85
93
### Text Insertion Methods
86
-
-**sendkeys**: Character-by-character simulation, better for special characters
87
-
-**clipboard**: Faster method using Ctrl+V, may not work in all applications
94
+
-**sendkeys**: Character-by-character simulation using pyautogui, better for special characters
95
+
-**clipboard**: Faster method using pyperclip + pyautogui Ctrl+V, may not work in all applications
88
96
89
97
### Configuration Parameters
90
98
Key settings in `PushToTalkConfig` class:
@@ -94,23 +102,32 @@ Key settings in `PushToTalkConfig` class:
Copy file name to clipboardExpand all lines: README.md
+60-22Lines changed: 60 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,32 +1,31 @@
1
1
# PushToTalk - AI Refined Speech-to-Text Dictation
2
2
3
-
A Python application that provides push-to-talk speech-to-text functionality with AI speech to text transcription, smart text refinement, and automatic text insertion into the active window on Windows. **Now features a persistent GUI configuration interface with real-time status management and easy application control.**
3
+
A Python application that provides push-to-talk speech-to-text functionality with AI speech to text transcription, smart text refinement, and automatic text insertion into the active window on Windows, MacOS, and Linux. **Now features a persistent GUI configuration interface with real-time status management and easy application control.**
4
4
5
5
## Features
6
6
7
7
-**🎯 GUI Interface**: Integrated configuration control and application status monitoring in one window
8
8
-**🎤 Push-to-Talk Recording**: Hold a customizable hotkey to record audio
9
9
-**🤖 Speech-to-Text**: Uses OpenAI Whisper for accurate transcription
10
+
-**⚡ Smart Audio Processing**: Automatic silence removal and pitch-preserving speed adjustment for faster transcription
10
11
-**✨ Text Refinement**: Improves transcription quality using Refinement Models
11
12
-**📝 Auto Text Insertion**: Automatically inserts refined text into the active window
12
13
-**🔊 Audio Feedback**: Optional audio cues for recording start/stop
13
-
-**📋 Multiple Insertion Methods**: Support for clipboard and sendkeys insertion
14
+
-**📋 Multiple Insertion Methods**: Support for `clipboard` and `sendkeys` insertion
14
15
15
16
## Roadmap
16
17
17
18
-[x] GUI for configuration
19
+
-[x] Full cross-platform support (Windows, MacOS, Linux)
18
20
-[ ] Customizable glossary for transcription refinement
19
-
-[ ] Streaming transcription with ongoing audio
20
21
-[ ] Local Whisper model support
21
-
-[ ]Cross-platform support (MacOS, Linux)
22
+
-[ ]Streaming transcription with ongoing audio (Optional)
- OpenAI API key (https://platform.openai.com/docs/api-reference/introduction)
28
28
- Microphone access (for recording)
29
-
- Administrator privileges (for global hotkey detection)
30
29
31
30
## Quick Start (GUI Application)
32
31
@@ -92,6 +91,7 @@ The application features a comprehensive, persistent configuration GUI with orga
92
91
-**Sample Rate**: 8kHz to 44.1kHz options (16kHz recommended)
93
92
-**Chunk Size**: Buffer size configuration
94
93
-**Channels**: Mono/stereo recording options
94
+
-**Audio Processing**: Smart silence removal and pitch-preserving speed adjustment
95
95
-**Helpful Recommendations**: Built-in guidance for optimal settings
96
96
97
97
### ⌨️ Hotkey Configuration
@@ -164,7 +164,12 @@ The application creates a `push_to_talk_config.json` file. Example configuration
164
164
"insertion_delay": 0.005,
165
165
"enable_text_refinement": true,
166
166
"enable_logging": true,
167
-
"enable_audio_feedback": true
167
+
"enable_audio_feedback": true,
168
+
"enable_audio_processing": true,
169
+
"debug_mode": false,
170
+
"silence_threshold": -16.0,
171
+
"min_silence_duration": 400.0,
172
+
"speed_factor": 1.5
168
173
}
169
174
```
170
175
@@ -185,6 +190,11 @@ The application creates a `push_to_talk_config.json` file. Example configuration
185
190
|`enable_text_refinement`| boolean |`true`| Whether to use GPT to refine transcribed text. Disable for faster processing without refinement. |
186
191
|`enable_logging`| boolean |`true`| Whether to enable detailed logging to `push_to_talk.log` file and console. |
187
192
|`enable_audio_feedback`| boolean |`true`| Whether to play sophisticated audio cues when starting/stopping recording. Provides immediate feedback for hotkey interactions. |
193
+
|`enable_audio_processing`| boolean |`true`| Whether to enable smart audio processing (silence removal and speed adjustment) for faster transcription. |
194
+
|`debug_mode`| boolean |`false`| Whether to enable debug mode. If enabled, processed audio files will be saved to the current directory. |
195
+
|`silence_threshold`| float |`-16.0`| dBFS threshold for silence detection. Higher values (closer to 0) are more sensitive to quiet sounds. |
196
+
|`min_silence_duration`| float |`400.0`| Minimum duration of silence in milliseconds required to split audio segments. |
197
+
|`speed_factor`| float |`1.5`| Speed adjustment factor. 1.5 means 1.5x faster playback while preserving pitch quality. |
188
198
189
199
#### Audio Quality Settings
190
200
@@ -202,6 +212,23 @@ The application creates a `push_to_talk_config.json` file. Example configuration
202
212
-`1` - Mono recording (recommended for speech)
203
213
-`2` - Stereo recording (unnecessary for speech-to-text)
204
214
215
+
#### Audio Processing Settings
216
+
217
+
-**silence_threshold**:
218
+
-`-16.0` (dBFS) - Recommended balance between noise removal and speech preservation
219
+
-`-10.0` - More aggressive silence removal (may cut quiet speech)
220
+
-`-30.0` - Less aggressive (keeps more background noise)
221
+
222
+
-**min_silence_duration**:
223
+
-`400.0` ms - Recommended for natural speech patterns
224
+
-`200.0` ms - More aggressive silence removal (faster processing)
225
+
-`800.0` ms - Conservative (preserves natural pauses)
226
+
227
+
-**speed_factor**:
228
+
-`1.5` - Recommended 1.5x speedup with pitch preservation
229
+
-`1.0` - No speed adjustment (original timing)
230
+
-`2.0` - 2x speedup (more aggressive, may affect quality)
231
+
205
232
### Hotkey Options
206
233
207
234
You can configure different hotkey combinations for both modes:
@@ -222,8 +249,8 @@ Both hotkeys support any combination from the `keyboard` library.
222
249
223
250
### Text Insertion Methods
224
251
225
-
-**sendkeys** (default): Simulates individual keystrokes, better for special characters
226
-
-**clipboard**: Faster and more reliable, uses Ctrl+V
252
+
-**sendkeys** (default): Simulates individual keystrokes using pyautogui, better for special characters
253
+
-**clipboard**: Faster and more reliable, uses pyperclip and pyautogui for Ctrl+V
227
254
228
255
### Audio Feedback
229
256
@@ -233,7 +260,7 @@ The application includes clean and simple audio feedback:
233
260
-**Recording Stop**: A lower confirmation beep (660 Hz) that confirms recording completion
234
261
-**Non-Blocking**: Audio playback runs in separate threads to avoid interfering with recording or transcription
235
262
-**Configurable**: Can be toggled on/off via GUI or configuration JSON file
@@ -258,9 +286,10 @@ The application consists of several modular components:
258
286
-**ConfigurationGUI** (`src/config_gui.py`): User-friendly GUI for settings management
259
287
-**MainGUI** (`main_gui.py`): Entry point with welcome flow and startup management
260
288
-**AudioRecorder** (`src/audio_recorder.py`): Handles audio recording using PyAudio
289
+
-**AudioProcessor** (`src/audio_processor.py`): Smart audio processing with silence removal and pitch-preserving speed adjustment using pydub and psola
261
290
-**Transcriber** (`src/transcription.py`): Converts speech to text using OpenAI Whisper
262
291
-**TextRefiner** (`src/text_refiner.py`): Improves transcription using Refinement Models
263
-
-**TextInserter** (`src/text_inserter.py`): Inserts text into active windows using pywin32
292
+
-**TextInserter** (`src/text_inserter.py`): Inserts text into active windows using pyautogui and pyperclip
264
293
-**HotkeyService** (`src/hotkey_service.py`): Manages global hotkey detection
265
294
-**PushToTalkApp** (`src/push_to_talk.py`): Main application orchestrator with dynamic configuration updates
266
295
@@ -278,17 +307,24 @@ The application consists of several modular components:
278
307
279
308
1. User presses hotkey → Audio recording starts
280
309
2. User releases hotkey → Recording stops
281
-
3. Audio file is sent to OpenAI Whisper for transcription
282
-
4. Raw transcription is refined using Refinement Models (if enabled)
283
-
5. Refined text is inserted into the active window
310
+
3. Audio file is processed (silence removal and speed adjustment for faster transcription)
311
+
4. Processed audio is sent to OpenAI Whisper for transcription
312
+
5. Raw transcription is refined using Refinement Models (if enabled)
313
+
6. Refined text is inserted into the active window
284
314
285
315
## Dependencies
286
316
287
317
-**tkinter**: GUI interface (built into Python)
288
318
-**keyboard**: Global hotkey detection
319
+
-**numpy**: Audio tone generation for feedback sounds
289
320
-**pyaudio**: Audio recording
321
+
-**pydub**: Smart silence detection and audio manipulation
0 commit comments