Releases: mdangschat/ctc-asr
Releases · mdangschat/ctc-asr
Version 0.1.0
- Excluded corpus creation from this project and moved it to its own (speech-corpus-dl)
- Restructured the bucket boundary calculations
- Updated documentation
Version 0.0.7
Update to TensorFlow version 1.12.0+. Use the tf.data
function.
- The model now uses TensorFlow's current
tf.data
API. - The model now uses an Estimator.
- Corpus
- The corpus metadata is now stored in CSV files.
- LibriSpeach loader can now use multiple threads.
- Improved corpus generation performance.
- Fixed the GPU logging hook.
- Major code refactoring.
Version 0.0.6
- Added a downloader script that downloads and creates the used training corpus.
- Unified
python/model.py
for the different DS1 and DS2 architectures.
Version 0.0.5
First Batch of Extended Goals
- Removed overlong samples
- Extended plotting (pyplot without display)
- Added new dataset: Tatoeba
- Rewritten large parts of the load dataset functionality
- Added random seed
- Unified relative paths based on params.BASE_PATH
- Bug fixes
Version 0.0.4
Performance Optimization and Evaluation
- After every epoch, a validation run on the
dev
dataset is being run. #68 - Fixed SortaGrad bug with switching from the first epoch. #108, #121
- Increased number of buckets, to reduce the amount of padding used. #106
- Storing more checkpoints, also spaced out the steps at which checkpoints are written. #122
- Combined [DS1] style model into the modified [DS2] style one. Switching between them can be done via the
params.py
options. #119, #110 - Added threading to most
/loader/
functions. #112 - Optimized loader methods and datasets further. #112, #99, #113
- See Milestone v0.0.4 for further information's.
Version 0.0.3
Deep Speech 2 Inspired Features
- Added SortaGrad ordering of training data for the first epoch. [Deep Speech 2]
- Replaced the first 3 dense layers with 2D convolutions over time and frequency [DS2]
- Added support for cuDNN RNN cell usage.
- Datasets
- Added Mozilla Common Voice data.
- Removed audio files with only 1 character long labels.
- Removed files from TEDLIUM that had labels shorter than 5 words.
- Removed audio files that had feature vectors longer than 1750.
- Added documentation about tensor shapes back in.
- Changed the input features to an 80 element logarithmic filter bank, with an option to switch to 80 element MFCC features.
- See Milestone v0.0.3 for further information.
Version 0.0.2
Improvements & Preprocessing
- Added Dropout functionality.
- Evaluation can now be run on the
validation
ortest
dataset. - Preprocessing and input optimization:
- Using
python_speech_features
to load audio, instead of the oldlibrosa
, which had problems running in parallel. - Added three new feature normalization methods.
- Skipped every 2nd input feature frame (on the time axis).
- Audiofiles can now be normalized by power level.
- Using
- Added support for Baidu's WarpCTC, which should be faster.
- WarpCTC usage is bugged in evaluation runs.
- Updated project
README.md
. - Updated
generate_txt.py
for new datasets (TEDLIUMv2, LibriSpeech, TIMIT). - For more details please reference the Milestone v0.0.2
Version 0.0.1
First Prototype
DeepSpeech (v1)
- Trains on LibriSpeech ASR Coprus.
- Uses GRUCells in the bidirectional layer.
- Layout: 3 dense, 1 bidirectional RNN, 1 dense, 1 dense/logits layers.
- Evaluation isn't polished.
- No single input inference implemented.
- See Milestone v0.0.1 for further infos.