Skip to content

Commit 45dd0b7

Browse files
authored
Merge pull request #5 from yixin0829/feat/silence-cropping-speed-up
Add smart audio processing with silence cropping and pitch-preserving speed adjustment
2 parents c09d469 + a9d4ef2 commit 45dd0b7

File tree

12 files changed

+1628
-118
lines changed

12 files changed

+1628
-118
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,6 @@ wheels/
1515

1616
push_to_talk.log
1717
push_to_talk_config.json
18+
19+
.claude
20+
.ruff_cache

.python-version

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
3.13
1+
3.12

CLAUDE.md

Lines changed: 33 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -37,20 +37,27 @@ uv run ruff check
3737

3838
# Fix linting issues automatically
3939
uv run ruff check --fix
40+
41+
# Setup pre-commit hooks
42+
uv run pre-commit install
43+
44+
# Run pre-commit hooks manually
45+
uv run pre-commit run --all-files
4046
```
4147

4248
## Architecture Overview
4349

44-
This is a Windows push-to-talk speech-to-text application with dual interfaces (GUI and console) that uses OpenAI's API for transcription and text refinement.
50+
This is a cross-platform push-to-talk speech-to-text application with dual interfaces (GUI and console) that uses OpenAI's API for transcription and text refinement. **Now supports Windows, MacOS, and Linux.**
4551

4652
### Core Components
4753
- **PushToTalkApp** (`src/push_to_talk.py`): Main orchestrator with configuration management and dynamic updates
4854
- **ConfigurationGUI** (`src/config_gui.py`): Persistent GUI interface with real-time status management
4955
- **AudioRecorder** (`src/audio_recorder.py`): PyAudio-based recording with configurable audio settings
5056
- **Transcriber** (`src/transcription.py`): OpenAI Whisper integration for speech-to-text
5157
- **TextRefiner** (`src/text_refiner.py`): GPT-based text improvement and correction
52-
- **TextInserter** (`src/text_inserter.py`): Windows text insertion via clipboard or sendkeys
53-
- **HotkeyService** (`src/hotkey_service.py`): Global hotkey detection requiring admin privileges
58+
- **TextInserter** (`src/text_inserter.py`): Cross-platform text insertion via pyautogui and pyperclip
59+
- **HotkeyService** (`src/hotkey_service.py`): Global hotkey detection (Windows admin privileges required)
60+
- **Utils** (`src/utils.py`): Cross-platform audio feedback using pygame and numpy
5461

5562
### Entry Points
5663
- **main_gui.py**: GUI application with persistent configuration interface
@@ -62,7 +69,7 @@ This is a Windows push-to-talk speech-to-text application with dual interfaces (
6269
2. User releases hotkey → Recording stops, audio saved to temp file
6370
3. Audio sent to OpenAI Whisper for transcription
6471
4. Raw text optionally refined using GPT models
65-
5. Refined text inserted into active window via Windows API
72+
5. Refined text inserted into active window via cross-platform APIs
6673

6774
### Configuration System
6875
- **File-based**: `push_to_talk_config.json` for persistent settings
@@ -72,19 +79,20 @@ This is a Windows push-to-talk speech-to-text application with dual interfaces (
7279

7380
## Key Technical Details
7481

75-
### Windows-Specific Requirements
76-
- **Administrator privileges**: Required for global hotkey detection
77-
- **pywin32**: Used for Windows text insertion and audio feedback
78-
- **Audio permissions**: Microphone access required for recording
82+
### Cross-Platform Support
83+
- **Text insertion**: Uses pyautogui and pyperclip for cross-platform compatibility
84+
- **Audio feedback**: Uses pygame mixer with numpy tone generation
85+
- **Hotkey detection**: Uses keyboard library (Windows admin privileges still required)
86+
- **GUI**: Tkinter for cross-platform interface
7987

8088
### Audio Processing
8189
- **Sample rates**: 8kHz-44.1kHz supported, 16kHz recommended for Whisper
8290
- **Formats**: WAV files for temporary audio storage
83-
- **Feedback**: Optional audio cues using Windows winsound module
91+
- **Feedback**: Cross-platform audio cues using pygame with pure tone generation (880Hz start, 660Hz stop)
8492

8593
### Text Insertion Methods
86-
- **sendkeys**: Character-by-character simulation, better for special characters
87-
- **clipboard**: Faster method using Ctrl+V, may not work in all applications
94+
- **sendkeys**: Character-by-character simulation using pyautogui, better for special characters
95+
- **clipboard**: Faster method using pyperclip + pyautogui Ctrl+V, may not work in all applications
8896

8997
### Configuration Parameters
9098
Key settings in `PushToTalkConfig` class:
@@ -94,23 +102,32 @@ Key settings in `PushToTalkConfig` class:
94102
- `hotkey`/`toggle_hotkey`: Customizable key combinations
95103
- `insertion_method`: "sendkeys" or "clipboard"
96104
- `enable_text_refinement`: Toggle GPT text improvement
105+
- `enable_audio_feedback`: Toggle audio feedback sounds
97106

98107
## Development Workflow
99108

100109
### Making Changes
101110
1. Test changes with both GUI and console applications
102-
2. Ensure admin privileges are handled correctly for hotkey functionality
111+
2. Ensure proper cross-platform compatibility for new features
103112
3. Validate OpenAI API integration with proper error handling
104-
4. Test text insertion in various Windows applications
113+
4. Test text insertion in various applications across platforms
114+
5. Pre-commit hooks automatically format and lint code
105115

106116
### Building for Distribution
107-
1. Use `build.bat` for standard GUI executable
117+
1. Use `build.bat` for Windows GUI executable
108118
2. Modify `push_to_talk.spec` for console builds or customization
109-
3. Test executable on clean Windows system without Python installed
119+
3. Test executable on clean systems without Python installed
110120
4. Consider antivirus false positives with PyInstaller executables
121+
5. Update hiddenimports in `.spec` file when adding new dependencies
111122

112123
### Configuration Testing
113124
- Use GUI "Test Configuration" button for API validation
114125
- Test hotkey combinations don't conflict with system shortcuts
115-
- Verify text insertion works in target applications (text editors, browsers, etc.)
126+
- Verify text insertion works in target applications across platforms
116127
- Check audio settings produce clear recordings for transcription accuracy
128+
- Test audio feedback works across different audio systems
129+
130+
### Dependencies Management
131+
- Core dependencies: keyboard, numpy, openai, pyaudio, pyautogui, pyperclip, pygame
132+
- Dev dependencies: pre-commit, pyinstaller, python-dotenv, ruff
133+
- PyInstaller spec may need updates for cross-platform builds (currently Windows-focused)

README.md

Lines changed: 60 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,31 @@
11
# PushToTalk - AI Refined Speech-to-Text Dictation
22

3-
A Python application that provides push-to-talk speech-to-text functionality with AI speech to text transcription, smart text refinement, and automatic text insertion into the active window on Windows. **Now features a persistent GUI configuration interface with real-time status management and easy application control.**
3+
A Python application that provides push-to-talk speech-to-text functionality with AI speech to text transcription, smart text refinement, and automatic text insertion into the active window on Windows, MacOS, and Linux. **Now features a persistent GUI configuration interface with real-time status management and easy application control.**
44

55
## Features
66

77
- **🎯 GUI Interface**: Integrated configuration control and application status monitoring in one window
88
- **🎤 Push-to-Talk Recording**: Hold a customizable hotkey to record audio
99
- **🤖 Speech-to-Text**: Uses OpenAI Whisper for accurate transcription
10+
- **⚡ Smart Audio Processing**: Automatic silence removal and pitch-preserving speed adjustment for faster transcription
1011
- **✨ Text Refinement**: Improves transcription quality using Refinement Models
1112
- **📝 Auto Text Insertion**: Automatically inserts refined text into the active window
1213
- **🔊 Audio Feedback**: Optional audio cues for recording start/stop
13-
- **📋 Multiple Insertion Methods**: Support for clipboard and sendkeys insertion
14+
- **📋 Multiple Insertion Methods**: Support for `clipboard` and `sendkeys` insertion
1415

1516
## Roadmap
1617

1718
- [x] GUI for configuration
19+
- [x] Full cross-platform support (Windows, MacOS, Linux)
1820
- [ ] Customizable glossary for transcription refinement
19-
- [ ] Streaming transcription with ongoing audio
2021
- [ ] Local Whisper model support
21-
- [ ] Cross-platform support (MacOS, Linux)
22+
- [ ] Streaming transcription with ongoing audio (Optional)
2223

2324
## Requirements
2425

25-
- Windows OS (10/11)
2626
- [uv](https://docs.astral.sh/uv/) (Python package manager)
2727
- OpenAI API key (https://platform.openai.com/docs/api-reference/introduction)
2828
- Microphone access (for recording)
29-
- Administrator privileges (for global hotkey detection)
3029

3130
## Quick Start (GUI Application)
3231

@@ -92,6 +91,7 @@ The application features a comprehensive, persistent configuration GUI with orga
9291
- **Sample Rate**: 8kHz to 44.1kHz options (16kHz recommended)
9392
- **Chunk Size**: Buffer size configuration
9493
- **Channels**: Mono/stereo recording options
94+
- **Audio Processing**: Smart silence removal and pitch-preserving speed adjustment
9595
- **Helpful Recommendations**: Built-in guidance for optimal settings
9696

9797
### ⌨️ Hotkey Configuration
@@ -164,7 +164,12 @@ The application creates a `push_to_talk_config.json` file. Example configuration
164164
"insertion_delay": 0.005,
165165
"enable_text_refinement": true,
166166
"enable_logging": true,
167-
"enable_audio_feedback": true
167+
"enable_audio_feedback": true,
168+
"enable_audio_processing": true,
169+
"debug_mode": false,
170+
"silence_threshold": -16.0,
171+
"min_silence_duration": 400.0,
172+
"speed_factor": 1.5
168173
}
169174
```
170175

@@ -185,6 +190,11 @@ The application creates a `push_to_talk_config.json` file. Example configuration
185190
| `enable_text_refinement` | boolean | `true` | Whether to use GPT to refine transcribed text. Disable for faster processing without refinement. |
186191
| `enable_logging` | boolean | `true` | Whether to enable detailed logging to `push_to_talk.log` file and console. |
187192
| `enable_audio_feedback` | boolean | `true` | Whether to play sophisticated audio cues when starting/stopping recording. Provides immediate feedback for hotkey interactions. |
193+
| `enable_audio_processing` | boolean | `true` | Whether to enable smart audio processing (silence removal and speed adjustment) for faster transcription. |
194+
| `debug_mode` | boolean | `false` | Whether to enable debug mode. If enabled, processed audio files will be saved to the current directory. |
195+
| `silence_threshold` | float | `-16.0` | dBFS threshold for silence detection. Higher values (closer to 0) are more sensitive to quiet sounds. |
196+
| `min_silence_duration` | float | `400.0` | Minimum duration of silence in milliseconds required to split audio segments. |
197+
| `speed_factor` | float | `1.5` | Speed adjustment factor. 1.5 means 1.5x faster playback while preserving pitch quality. |
188198

189199
#### Audio Quality Settings
190200

@@ -202,6 +212,23 @@ The application creates a `push_to_talk_config.json` file. Example configuration
202212
- `1` - Mono recording (recommended for speech)
203213
- `2` - Stereo recording (unnecessary for speech-to-text)
204214

215+
#### Audio Processing Settings
216+
217+
- **silence_threshold**:
218+
- `-16.0` (dBFS) - Recommended balance between noise removal and speech preservation
219+
- `-10.0` - More aggressive silence removal (may cut quiet speech)
220+
- `-30.0` - Less aggressive (keeps more background noise)
221+
222+
- **min_silence_duration**:
223+
- `400.0` ms - Recommended for natural speech patterns
224+
- `200.0` ms - More aggressive silence removal (faster processing)
225+
- `800.0` ms - Conservative (preserves natural pauses)
226+
227+
- **speed_factor**:
228+
- `1.5` - Recommended 1.5x speedup with pitch preservation
229+
- `1.0` - No speed adjustment (original timing)
230+
- `2.0` - 2x speedup (more aggressive, may affect quality)
231+
205232
### Hotkey Options
206233

207234
You can configure different hotkey combinations for both modes:
@@ -222,8 +249,8 @@ Both hotkeys support any combination from the `keyboard` library.
222249

223250
### Text Insertion Methods
224251

225-
- **sendkeys** (default): Simulates individual keystrokes, better for special characters
226-
- **clipboard**: Faster and more reliable, uses Ctrl+V
252+
- **sendkeys** (default): Simulates individual keystrokes using pyautogui, better for special characters
253+
- **clipboard**: Faster and more reliable, uses pyperclip and pyautogui for Ctrl+V
227254

228255
### Audio Feedback
229256

@@ -233,7 +260,7 @@ The application includes clean and simple audio feedback:
233260
- **Recording Stop**: A lower confirmation beep (660 Hz) that confirms recording completion
234261
- **Non-Blocking**: Audio playback runs in separate threads to avoid interfering with recording or transcription
235262
- **Configurable**: Can be toggled on/off via GUI or configuration JSON file
236-
- **Minimal Dependencies**: Uses Windows' built-in `winsound` module - no additional packages required
263+
- **Cross-Platform**: Uses `pygame` and `numpy` for tone generation - works on Windows, MacOS, and Linux
237264

238265
## Architecture
239266

@@ -246,7 +273,8 @@ flowchart TB
246273
%% Main Flow
247274
PushToTalkApp -->|"Initialize"| HotkeyService
248275
HotkeyService -->|"Start/Stop Recording"| AudioRecorder
249-
AudioRecorder -->|"Audio File"| Transcriber
276+
AudioRecorder -->|"Audio File"| AudioProcessor
277+
AudioProcessor -->|"Processed Audio"| Transcriber
250278
Transcriber -->|"AI Transcription"| TextRefiner
251279
TextRefiner -->|"AI Refinement"| TextInserter
252280
```
@@ -258,9 +286,10 @@ The application consists of several modular components:
258286
- **ConfigurationGUI** (`src/config_gui.py`): User-friendly GUI for settings management
259287
- **MainGUI** (`main_gui.py`): Entry point with welcome flow and startup management
260288
- **AudioRecorder** (`src/audio_recorder.py`): Handles audio recording using PyAudio
289+
- **AudioProcessor** (`src/audio_processor.py`): Smart audio processing with silence removal and pitch-preserving speed adjustment using pydub and psola
261290
- **Transcriber** (`src/transcription.py`): Converts speech to text using OpenAI Whisper
262291
- **TextRefiner** (`src/text_refiner.py`): Improves transcription using Refinement Models
263-
- **TextInserter** (`src/text_inserter.py`): Inserts text into active windows using pywin32
292+
- **TextInserter** (`src/text_inserter.py`): Inserts text into active windows using pyautogui and pyperclip
264293
- **HotkeyService** (`src/hotkey_service.py`): Manages global hotkey detection
265294
- **PushToTalkApp** (`src/push_to_talk.py`): Main application orchestrator with dynamic configuration updates
266295

@@ -278,17 +307,24 @@ The application consists of several modular components:
278307

279308
1. User presses hotkey → Audio recording starts
280309
2. User releases hotkey → Recording stops
281-
3. Audio file is sent to OpenAI Whisper for transcription
282-
4. Raw transcription is refined using Refinement Models (if enabled)
283-
5. Refined text is inserted into the active window
310+
3. Audio file is processed (silence removal and speed adjustment for faster transcription)
311+
4. Processed audio is sent to OpenAI Whisper for transcription
312+
5. Raw transcription is refined using Refinement Models (if enabled)
313+
6. Refined text is inserted into the active window
284314

285315
## Dependencies
286316

287317
- **tkinter**: GUI interface (built into Python)
288318
- **keyboard**: Global hotkey detection
319+
- **numpy**: Audio tone generation for feedback sounds
289320
- **pyaudio**: Audio recording
321+
- **pydub**: Smart silence detection and audio manipulation
322+
- **soundfile**: High-quality audio I/O
323+
- **psola**: Pitch-preserving time-scale modification
290324
- **openai**: Speech-to-text and text refinement
291-
- **pywin32**: Windows-specific text insertion and audio feedback (winsound)
325+
- **pyautogui**: Cross-platform text insertion and window management
326+
- **pyperclip**: Cross-platform clipboard operations
327+
- **pygame**: Cross-platform audio feedback
292328
- **python-dotenv**: Environment variable management
293329

294330
## Troubleshooting
@@ -317,9 +353,9 @@ The application consists of several modular components:
317353

318354
### Common Issues
319355

320-
1. **"No module named 'pywin32'"** (Development):
356+
1. **"No module named 'pyautogui' or 'pyperclip'"** (Development):
321357
```bash
322-
uv add pywin32
358+
uv add pyautogui pyperclip
323359
```
324360

325361
2. **"Could not find PyAudio"** (Development):
@@ -426,10 +462,12 @@ if result == "close":
426462
## Performance Tips
427463

428464
1. **Optimize audio settings**: Lower sample rates (8000-16000 Hz) for faster processing
429-
2. **Disable text refinement**: For faster transcription without GPT processing
430-
3. **Use clipboard method**: Generally faster than sendkeys for text insertion
431-
4. **Short recordings**: Keep recordings under 30 seconds for optimal performance
432-
5. **Monitor via GUI**: Use the status indicators to verify application is running efficiently
465+
2. **Enable audio processing**: Smart silence removal and speed adjustment can significantly reduce transcription time
466+
3. **Adjust silence threshold**: Fine-tune -16 dBFS for your environment (higher for noisy environments)
467+
4. **Disable text refinement**: For faster transcription without GPT processing
468+
5. **Use clipboard method**: Generally faster than sendkeys for text insertion
469+
6. **Short recordings**: Keep recordings under 30 seconds for optimal performance
470+
7. **Monitor via GUI**: Use the status indicators to verify application is running efficiently
433471

434472
## Security Considerations
435473

icon.ico

216 KB
Binary file not shown.

pyproject.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,18 @@ name = "push-to-talk"
33
version = "0.1.0"
44
description = "A push-to-talk speech-to-text application using OpenAI API for real-time transcription and support active window text insertion."
55
readme = "README.md"
6-
requires-python = ">=3.13"
6+
requires-python = ">=3.9"
77
dependencies = [
88
"keyboard>=0.13.5",
9+
"numpy>=1.24.0",
910
"openai>=1.97.1",
1011
"pyaudio>=0.2.14",
1112
"pyautogui>=0.9.54",
1213
"pyperclip>=1.9.0",
14+
"pygame>=2.5.0",
15+
"soundfile>=0.13.1",
16+
"psola>=0.0.1",
17+
"pydub[scipy]>=0.25.1",
1318
]
1419

1520
[dependency-groups]

0 commit comments

Comments
 (0)