This is the official repo for our paper Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams in Speech Synthesis Workshop 2025 (SSW13) at Leeuwarden, the Netherlands.
Create a conda environment via:
conda env create -f environment.yaml
- Prepare the data in a Kaldi's
wav.scp
format. - Use a pre-trained Kaldi HMM-DNN model to extract PPGs from speech. Kaldi docs are helpful to do that.
- Extract Speaker Embedding using Wespeaker cli. Specifically, you should use
wespeaker --task embedding_kaldi --wav_scp YOUR_WAV.scp --output_file /path/to/embedding
, see here. - Extract Pitch and Periodicity using
ppg_tts/feature_extract/penn_log_f0_extract.py
.
Pretrained checkpoint is available here.
Pretrained HiFi-GAN generator checkpoint is avaible here. Please put the HiFi-GAN checkpoint under vocoder/hifigan/ckpt
.
python -m ppg_tts.main fit -c config/fit_ppgmatcha.yaml -c config/data_template.yaml
You can overwrite the arguments via CLI, see pytorch-lightning docs.
See ppg_tts/evaluation/evaluate_copy_synthesis.sh
See ppg_tts/evaluation/evaluate_switch_speaker.sh
See ppg_tts/evaluation/evaluate_editing/evaluate_editing.sh
Currently we don't have specific script for inferencing a pre-trained models with minimum efforts, but inference can be done via executing certain stages in the evaluation script. Specialized inference script will be available in the future.
To do inference, data should be prepared the same as the training data (see here).
Follow the comment in ppg_tts/evaluation/evaluate_switch_speaker.sh and set start=0
and end=1
to do TTS inference.
Follow the comment in ppg_tts/evaluation/evaluate_editing/evaluate_editing.sh and set start=0
and end=1
to do editing inference.
Coming soon.
Our work is shared under Creative Commons Attribution 4.0 International (CC-BY-4.0)