yixin0829
diff --git a/‎.gitignore
Lines changed: 3 additions & 0 deletions b/‎.gitignore
Lines changed: 3 additions & 0 deletions
diff --git a/‎.python-version
Lines changed: 1 addition & 1 deletion b/‎.python-version
Lines changed: 1 addition & 1 deletion
diff --git a/‎CLAUDE.md
Lines changed: 33 additions & 16 deletions b/‎CLAUDE.md
Lines changed: 33 additions & 16 deletions
diff --git a/‎README.md
Lines changed: 60 additions & 22 deletions b/‎README.md
Lines changed: 60 additions & 22 deletions
diff --git a/‎icon.ico
216 KB b/‎icon.ico
216 KB
diff --git a/‎pyproject.toml
Lines changed: 6 additions & 1 deletion b/‎pyproject.toml
Lines changed: 6 additions & 1 deletion
@@ -15,3 +15,6 @@ wheels/
 
 push_to_talk.log
 push_to_talk_config.json
+
+.claude
+.ruff_cache
@@ -1 +1 @@
-3.13
+3.12
@@ -37,20 +37,27 @@ uv run ruff check
 
 # Fix linting issues automatically
 uv run ruff check --fix
+
+# Setup pre-commit hooks
+uv run pre-commit install
+
+# Run pre-commit hooks manually
+uv run pre-commit run --all-files
 ```
 
 ## Architecture Overview
 
-This is a Windows push-to-talk speech-to-text application with dual interfaces (GUI and console) that uses OpenAI's API for transcription and text refinement.
+This is a cross-platform push-to-talk speech-to-text application with dual interfaces (GUI and console) that uses OpenAI's API for transcription and text refinement. **Now supports Windows, MacOS, and Linux.**
 
 ### Core Components
 - **PushToTalkApp** (`src/push_to_talk.py`): Main orchestrator with configuration management and dynamic updates
 - **ConfigurationGUI** (`src/config_gui.py`): Persistent GUI interface with real-time status management
 - **AudioRecorder** (`src/audio_recorder.py`): PyAudio-based recording with configurable audio settings
 - **Transcriber** (`src/transcription.py`): OpenAI Whisper integration for speech-to-text
 - **TextRefiner** (`src/text_refiner.py`): GPT-based text improvement and correction
-- **TextInserter** (`src/text_inserter.py`): Windows text insertion via clipboard or sendkeys
-- **HotkeyService** (`src/hotkey_service.py`): Global hotkey detection requiring admin privileges
+- **TextInserter** (`src/text_inserter.py`): Cross-platform text insertion via pyautogui and pyperclip
+- **HotkeyService** (`src/hotkey_service.py`): Global hotkey detection (Windows admin privileges required)
+- **Utils** (`src/utils.py`): Cross-platform audio feedback using pygame and numpy
 
 ### Entry Points
 - **main_gui.py**: GUI application with persistent configuration interface
@@ -62,7 +69,7 @@ This is a Windows push-to-talk speech-to-text application with dual interfaces (
 2. User releases hotkey → Recording stops, audio saved to temp file
 3. Audio sent to OpenAI Whisper for transcription
 4. Raw text optionally refined using GPT models
-5. Refined text inserted into active window via Windows API
+5. Refined text inserted into active window via cross-platform APIs
 
 ### Configuration System
 - **File-based**: `push_to_talk_config.json` for persistent settings
@@ -72,19 +79,20 @@ This is a Windows push-to-talk speech-to-text application with dual interfaces (
 
 ## Key Technical Details
 
-### Windows-Specific Requirements
-- **Administrator privileges**: Required for global hotkey detection
-- **pywin32**: Used for Windows text insertion and audio feedback
-- **Audio permissions**: Microphone access required for recording
+### Cross-Platform Support
+- **Text insertion**: Uses pyautogui and pyperclip for cross-platform compatibility
+- **Audio feedback**: Uses pygame mixer with numpy tone generation
+- **Hotkey detection**: Uses keyboard library (Windows admin privileges still required)
+- **GUI**: Tkinter for cross-platform interface
 
 ### Audio Processing
 - **Sample rates**: 8kHz-44.1kHz supported, 16kHz recommended for Whisper
 - **Formats**: WAV files for temporary audio storage
-- **Feedback**: Optional audio cues using Windows winsound module
+- **Feedback**: Cross-platform audio cues using pygame with pure tone generation (880Hz start, 660Hz stop)
 
 ### Text Insertion Methods
-- **sendkeys**: Character-by-character simulation, better for special characters
-- **clipboard**: Faster method using Ctrl+V, may not work in all applications
+- **sendkeys**: Character-by-character simulation using pyautogui, better for special characters
+- **clipboard**: Faster method using pyperclip + pyautogui Ctrl+V, may not work in all applications
 
 ### Configuration Parameters
 Key settings in `PushToTalkConfig` class:
@@ -94,23 +102,32 @@ Key settings in `PushToTalkConfig` class:
 - `hotkey`/`toggle_hotkey`: Customizable key combinations
 - `insertion_method`: "sendkeys" or "clipboard"
 - `enable_text_refinement`: Toggle GPT text improvement
+- `enable_audio_feedback`: Toggle audio feedback sounds
 
 ## Development Workflow
 
 ### Making Changes
 1. Test changes with both GUI and console applications
-2. Ensure admin privileges are handled correctly for hotkey functionality
+2. Ensure proper cross-platform compatibility for new features
 3. Validate OpenAI API integration with proper error handling
-4. Test text insertion in various Windows applications
+4. Test text insertion in various applications across platforms
+5. Pre-commit hooks automatically format and lint code
 
 ### Building for Distribution
-1. Use `build.bat` for standard GUI executable
+1. Use `build.bat` for Windows GUI executable
 2. Modify `push_to_talk.spec` for console builds or customization
-3. Test executable on clean Windows system without Python installed
+3. Test executable on clean systems without Python installed
 4. Consider antivirus false positives with PyInstaller executables
+5. Update hiddenimports in `.spec` file when adding new dependencies
 
 ### Configuration Testing
 - Use GUI "Test Configuration" button for API validation
 - Test hotkey combinations don't conflict with system shortcuts
-- Verify text insertion works in target applications (text editors, browsers, etc.)
+- Verify text insertion works in target applications across platforms
 - Check audio settings produce clear recordings for transcription accuracy
+- Test audio feedback works across different audio systems
+
+### Dependencies Management
+- Core dependencies: keyboard, numpy, openai, pyaudio, pyautogui, pyperclip, pygame
+- Dev dependencies: pre-commit, pyinstaller, python-dotenv, ruff
+- PyInstaller spec may need updates for cross-platform builds (currently Windows-focused)
@@ -1,32 +1,31 @@
 # PushToTalk - AI Refined Speech-to-Text Dictation
 
-A Python application that provides push-to-talk speech-to-text functionality with AI speech to text transcription, smart text refinement, and automatic text insertion into the active window on Windows. **Now features a persistent GUI configuration interface with real-time status management and easy application control.**
+A Python application that provides push-to-talk speech-to-text functionality with AI speech to text transcription, smart text refinement, and automatic text insertion into the active window on Windows, MacOS, and Linux. **Now features a persistent GUI configuration interface with real-time status management and easy application control.**
 
 ## Features
 
 - **🎯 GUI Interface**: Integrated configuration control and application status monitoring in one window
 - **🎤 Push-to-Talk Recording**: Hold a customizable hotkey to record audio
 - **🤖 Speech-to-Text**: Uses OpenAI Whisper for accurate transcription
+- **⚡ Smart Audio Processing**: Automatic silence removal and pitch-preserving speed adjustment for faster transcription
 - **✨ Text Refinement**: Improves transcription quality using Refinement Models
 - **📝 Auto Text Insertion**: Automatically inserts refined text into the active window
 - **🔊 Audio Feedback**: Optional audio cues for recording start/stop
-- **📋 Multiple Insertion Methods**: Support for clipboard and sendkeys insertion
+- **📋 Multiple Insertion Methods**: Support for `clipboard` and `sendkeys` insertion
 
 ## Roadmap
 
 - [x] GUI for configuration
+- [x] Full cross-platform support (Windows, MacOS, Linux)
 - [ ] Customizable glossary for transcription refinement
-- [ ] Streaming transcription with ongoing audio
 - [ ] Local Whisper model support
-- [ ] Cross-platform support (MacOS, Linux)
+- [ ] Streaming transcription with ongoing audio (Optional)
 
 ## Requirements
 
-- Windows OS (10/11)
 - [uv](https://docs.astral.sh/uv/) (Python package manager)
 - OpenAI API key (https://platform.openai.com/docs/api-reference/introduction)
 - Microphone access (for recording)
-- Administrator privileges (for global hotkey detection)
 
 ## Quick Start (GUI Application)
 
@@ -92,6 +91,7 @@ The application features a comprehensive, persistent configuration GUI with orga
 - **Sample Rate**: 8kHz to 44.1kHz options (16kHz recommended)
 - **Chunk Size**: Buffer size configuration
 - **Channels**: Mono/stereo recording options
+- **Audio Processing**: Smart silence removal and pitch-preserving speed adjustment
 - **Helpful Recommendations**: Built-in guidance for optimal settings
 
 ### ⌨️ Hotkey Configuration
@@ -164,7 +164,12 @@ The application creates a `push_to_talk_config.json` file. Example configuration
   "insertion_delay": 0.005,
   "enable_text_refinement": true,
   "enable_logging": true,
-  "enable_audio_feedback": true
+  "enable_audio_feedback": true,
+  "enable_audio_processing": true,
+  "debug_mode": false,
+  "silence_threshold": -16.0,
+  "min_silence_duration": 400.0,
+  "speed_factor": 1.5
 }
 ```
 
@@ -185,6 +190,11 @@ The application creates a `push_to_talk_config.json` file. Example configuration
 | `enable_text_refinement` | boolean | `true` | Whether to use GPT to refine transcribed text. Disable for faster processing without refinement. |
 | `enable_logging` | boolean | `true` | Whether to enable detailed logging to `push_to_talk.log` file and console. |
 | `enable_audio_feedback` | boolean | `true` | Whether to play sophisticated audio cues when starting/stopping recording. Provides immediate feedback for hotkey interactions. |
+| `enable_audio_processing` | boolean | `true` | Whether to enable smart audio processing (silence removal and speed adjustment) for faster transcription. |
+| `debug_mode` | boolean | `false` | Whether to enable debug mode. If enabled, processed audio files will be saved to the current directory. |
+| `silence_threshold` | float | `-16.0` | dBFS threshold for silence detection. Higher values (closer to 0) are more sensitive to quiet sounds. |
+| `min_silence_duration` | float | `400.0` | Minimum duration of silence in milliseconds required to split audio segments. |
+| `speed_factor` | float | `1.5` | Speed adjustment factor. 1.5 means 1.5x faster playback while preserving pitch quality. |
 
 #### Audio Quality Settings
 
@@ -202,6 +212,23 @@ The application creates a `push_to_talk_config.json` file. Example configuration
   - `1` - Mono recording (recommended for speech)
   - `2` - Stereo recording (unnecessary for speech-to-text)
 
+#### Audio Processing Settings
+
+- **silence_threshold**:
+  - `-16.0` (dBFS) - Recommended balance between noise removal and speech preservation
+  - `-10.0` - More aggressive silence removal (may cut quiet speech)
+  - `-30.0` - Less aggressive (keeps more background noise)
+
+- **min_silence_duration**:
+  - `400.0` ms - Recommended for natural speech patterns
+  - `200.0` ms - More aggressive silence removal (faster processing)
+  - `800.0` ms - Conservative (preserves natural pauses)
+
+- **speed_factor**:
+  - `1.5` - Recommended 1.5x speedup with pitch preservation
+  - `1.0` - No speed adjustment (original timing)
+  - `2.0` - 2x speedup (more aggressive, may affect quality)
+
 ### Hotkey Options
 
 You can configure different hotkey combinations for both modes:
@@ -222,8 +249,8 @@ Both hotkeys support any combination from the `keyboard` library.
 
 ### Text Insertion Methods
 
-- **sendkeys** (default): Simulates individual keystrokes, better for special characters
-- **clipboard**: Faster and more reliable, uses Ctrl+V
+- **sendkeys** (default): Simulates individual keystrokes using pyautogui, better for special characters
+- **clipboard**: Faster and more reliable, uses pyperclip and pyautogui for Ctrl+V
 
 ### Audio Feedback
 
@@ -233,7 +260,7 @@ The application includes clean and simple audio feedback:
 - **Recording Stop**: A lower confirmation beep (660 Hz) that confirms recording completion
 - **Non-Blocking**: Audio playback runs in separate threads to avoid interfering with recording or transcription
 - **Configurable**: Can be toggled on/off via GUI or configuration JSON file
-- **Minimal Dependencies**: Uses Windows' built-in `winsound` module - no additional packages required
+- **Cross-Platform**: Uses `pygame` and `numpy` for tone generation - works on Windows, MacOS, and Linux
 
 ## Architecture
 
@@ -246,7 +273,8 @@ flowchart TB
     %% Main Flow
     PushToTalkApp -->|"Initialize"| HotkeyService
     HotkeyService -->|"Start/Stop Recording"| AudioRecorder
-    AudioRecorder -->|"Audio File"| Transcriber
+    AudioRecorder -->|"Audio File"| AudioProcessor
+    AudioProcessor -->|"Processed Audio"| Transcriber
     Transcriber -->|"AI Transcription"| TextRefiner
     TextRefiner -->|"AI Refinement"| TextInserter
 ```
@@ -258,9 +286,10 @@ The application consists of several modular components:
 - **ConfigurationGUI** (`src/config_gui.py`): User-friendly GUI for settings management
 - **MainGUI** (`main_gui.py`): Entry point with welcome flow and startup management
 - **AudioRecorder** (`src/audio_recorder.py`): Handles audio recording using PyAudio
+- **AudioProcessor** (`src/audio_processor.py`): Smart audio processing with silence removal and pitch-preserving speed adjustment using pydub and psola
 - **Transcriber** (`src/transcription.py`): Converts speech to text using OpenAI Whisper
 - **TextRefiner** (`src/text_refiner.py`): Improves transcription using Refinement Models
-- **TextInserter** (`src/text_inserter.py`): Inserts text into active windows using pywin32
+- **TextInserter** (`src/text_inserter.py`): Inserts text into active windows using pyautogui and pyperclip
 - **HotkeyService** (`src/hotkey_service.py`): Manages global hotkey detection
 - **PushToTalkApp** (`src/push_to_talk.py`): Main application orchestrator with dynamic configuration updates
 
@@ -278,17 +307,24 @@ The application consists of several modular components:
 
 1. User presses hotkey → Audio recording starts
 2. User releases hotkey → Recording stops
-3. Audio file is sent to OpenAI Whisper for transcription
-4. Raw transcription is refined using Refinement Models (if enabled)
-5. Refined text is inserted into the active window
+3. Audio file is processed (silence removal and speed adjustment for faster transcription)
+4. Processed audio is sent to OpenAI Whisper for transcription
+5. Raw transcription is refined using Refinement Models (if enabled)
+6. Refined text is inserted into the active window
 
 ## Dependencies
 
 - **tkinter**: GUI interface (built into Python)
 - **keyboard**: Global hotkey detection
+- **numpy**: Audio tone generation for feedback sounds
 - **pyaudio**: Audio recording
+- **pydub**: Smart silence detection and audio manipulation
+- **soundfile**: High-quality audio I/O
+- **psola**: Pitch-preserving time-scale modification
 - **openai**: Speech-to-text and text refinement
-- **pywin32**: Windows-specific text insertion and audio feedback (winsound)
+- **pyautogui**: Cross-platform text insertion and window management
+- **pyperclip**: Cross-platform clipboard operations
+- **pygame**: Cross-platform audio feedback
 - **python-dotenv**: Environment variable management
 
 ## Troubleshooting
@@ -317,9 +353,9 @@ The application consists of several modular components:
 
 ### Common Issues
 
-1. **"No module named 'pywin32'"** (Development):
+1. **"No module named 'pyautogui' or 'pyperclip'"** (Development):
    ```bash
-   uv add pywin32
+   uv add pyautogui pyperclip
    ```
 
 2. **"Could not find PyAudio"** (Development):
@@ -426,10 +462,12 @@ if result == "close":
 ## Performance Tips
 
 1. **Optimize audio settings**: Lower sample rates (8000-16000 Hz) for faster processing
-2. **Disable text refinement**: For faster transcription without GPT processing
-3. **Use clipboard method**: Generally faster than sendkeys for text insertion
-4. **Short recordings**: Keep recordings under 30 seconds for optimal performance
-5. **Monitor via GUI**: Use the status indicators to verify application is running efficiently
+2. **Enable audio processing**: Smart silence removal and speed adjustment can significantly reduce transcription time
+3. **Adjust silence threshold**: Fine-tune -16 dBFS for your environment (higher for noisy environments)
+4. **Disable text refinement**: For faster transcription without GPT processing
+5. **Use clipboard method**: Generally faster than sendkeys for text insertion
+6. **Short recordings**: Keep recordings under 30 seconds for optimal performance
+7. **Monitor via GUI**: Use the status indicators to verify application is running efficiently
 
 ## Security Considerations
 
 
@@ -3,13 +3,18 @@ name = "push-to-talk"
 version = "0.1.0"
 description = "A push-to-talk speech-to-text application using OpenAI API for real-time transcription and support active window text insertion."
 readme = "README.md"
-requires-python = ">=3.13"
+requires-python = ">=3.9"
 dependencies = [
     "keyboard>=0.13.5",
+    "numpy>=1.24.0",
     "openai>=1.97.1",
     "pyaudio>=0.2.14",
     "pyautogui>=0.9.54",
     "pyperclip>=1.9.0",
+    "pygame>=2.5.0",
+    "soundfile>=0.13.1",
+    "psola>=0.0.1",
+    "pydub[scipy]>=0.25.1",
 ]
 
 [dependency-groups]