Skip to content

Commit 53718cd

Browse files
committed
added speechsplit pyworld
1 parent b26cb6d commit 53718cd

File tree

13 files changed

+1406
-16
lines changed

13 files changed

+1406
-16
lines changed

README-pypi.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@ Features
3636
- **Speaker overlap**, detect overlap speakers using Finetuned Speaker Vector.
3737
- **Speaker Vector**, calculate similarity between speakers using Pretrained Speaker Vector.
3838
- **Speech Enhancement**, enhance voice activities using Waveform UNET.
39-
- **Speech-to-Text**, End-to-End Speech to Text for Malay and Mixed (Malay and Singlish) using RNN-Transducer.
39+
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.
40+
- **Speech-to-Text**, End-to-End Speech to Text for Malay and Mixed (Malay and Singlish) using RNN-Transducer and Wav2Vec2 CTC.
4041
- **Super Resolution**, Super Resolution 4x for Waveform.
4142
- **Text-to-Speech**, Text to Speech for Malay and Singlish using Tacotron2 and FastSpeech2.
4243
- **Vocoder**, convert Mel to Waveform using MelGAN, Multiband MelGAN and Universal MelGAN Vocoder.
@@ -71,6 +72,9 @@ Malaya-Speech also released pretrained models, simply check at `malaya-speech/pr
7172
- **FastVC**, Faster and Accurate Voice Conversion using Transformer, no paper produced.
7273
- **FastSep**, Faster and Accurate Speech Separation using Transformer, no paper produced.
7374
- **wav2vec 2.0**, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477
75+
- **FastSpeechSplit**, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.
76+
- **Sepformer**, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154
77+
- **FastSpeechSplit**, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.
7478

7579
References
7680
-----------

README.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Features
5555
- **Speaker overlap**, detect overlap speakers using Finetuned Speaker Vector.
5656
- **Speaker Vector**, calculate similarity between speakers using Pretrained Speaker Vector.
5757
- **Speech Enhancement**, enhance voice activities using Waveform UNET.
58-
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch.
58+
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.
5959
- **Speech-to-Text**, End-to-End Speech to Text for Malay and Mixed (Malay and Singlish) using RNN-Transducer and Wav2Vec2 CTC.
6060
- **Super Resolution**, Super Resolution 4x for Waveform.
6161
- **Text-to-Speech**, Text to Speech for Malay and Singlish using Tacotron2 and FastSpeech2.
@@ -93,6 +93,7 @@ Malaya-Speech also released pretrained models, simply check at `malaya-speech/pr
9393
- **wav2vec 2.0**, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477
9494
- **FastSpeechSplit**, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.
9595
- **Sepformer**, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154
96+
- **FastSpeechSplit**, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.
9697

9798
References
9899
-----------

docs/Api.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,18 @@ malaya_speech.model.tf.Split_Mel
112112
.. autoclass:: malaya_speech.model.tf.Split_Mel()
113113
:members:
114114

115+
malaya_speech.model.tf.Wav2Vec2_CTC
116+
-------------------------------------
117+
118+
.. autoclass:: malaya_speech.model.tf.Wav2Vec2_CTC()
119+
:members:
120+
121+
malaya_speech.model.tf.FastSpeechSplit
122+
---------------------------------------
123+
124+
.. autoclass:: malaya_speech.model.tf.FastSpeechSplit()
125+
:members:
126+
115127
malaya_speech.model.webrtc.WebRTC
116128
----------------------------------
117129

@@ -304,6 +316,12 @@ malaya_speech.speech_enhancement
304316
.. automodule:: malaya_speech.speech_enhancement
305317
:members:
306318

319+
malaya_speech.speechsplit_conversion
320+
--------------------------------------
321+
322+
.. automodule:: malaya_speech.speechsplit_conversion
323+
:members:
324+
307325
malaya_speech.stack
308326
-----------------------------------
309327

docs/README.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Features
5555
- **Speaker overlap**, detect overlap speakers using Finetuned Speaker Vector.
5656
- **Speaker Vector**, calculate similarity between speakers using Pretrained Speaker Vector.
5757
- **Speech Enhancement**, enhance voice activities using Waveform UNET.
58-
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch.
58+
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.
5959
- **Speech-to-Text**, End-to-End Speech to Text for Malay and Mixed (Malay and Singlish) using RNN-Transducer and Wav2Vec2 CTC.
6060
- **Super Resolution**, Super Resolution 4x for Waveform.
6161
- **Text-to-Speech**, Text to Speech for Malay and Singlish using Tacotron2 and FastSpeech2.
@@ -93,6 +93,7 @@ Malaya-Speech also released pretrained models, simply check at `malaya-speech/pr
9393
- **wav2vec 2.0**, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477
9494
- **FastSpeechSplit**, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.
9595
- **Sepformer**, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154
96+
- **FastSpeechSplit**, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.
9697

9798
References
9899
-----------

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ Contents:
6060
:caption: Conversion Module
6161

6262
load-voice-conversion
63+
speechsplit-conversion-pyworld
6364

6465
.. toctree::
6566
:maxdepth: 2

docs/load-voice-conversion.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
"cell_type": "markdown",
6262
"metadata": {},
6363
"source": [
64-
"### List available Voice Conversion"
64+
"### List available Voice Conversion models"
6565
]
6666
},
6767
{

docs/speechsplit-conversion-pyworld.ipynb

Lines changed: 674 additions & 0 deletions
Large diffs are not rendered by default.

example/speechsplit-conversion-pyworld/speechsplit-conversion-pyworld.ipynb

Lines changed: 674 additions & 0 deletions
Large diffs are not rendered by default.

example/voice-conversion/load-voice-conversion.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
"cell_type": "markdown",
6262
"metadata": {},
6363
"source": [
64-
"### List available Voice Conversion"
64+
"### List available Voice Conversion models"
6565
]
6666
},
6767
{

malaya_speech/model/tf.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1415,7 +1415,6 @@ def __init__(
14151415
output_nodes,
14161416
speaker_vector,
14171417
gender_model,
1418-
magnitude,
14191418
sess,
14201419
model,
14211420
name,
@@ -1424,14 +1423,13 @@ def __init__(
14241423
self._output_nodes = output_nodes
14251424
self._speaker_vector = speaker_vector
14261425
self._gender_model = gender_model
1427-
self._magnitude = magnitude
14281426
self._sess = sess
14291427
self.__model__ = model
14301428
self.__name__ = name
14311429
self._modes = {'R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'}
14321430
self._freqs = {'female': [100, 600], 'male': [50, 250]}
14331431

1434-
def _get_data(x, sr = 22050, target_sr = 16000):
1432+
def _get_data(self, x, sr = 22050, target_sr = 16000):
14351433
x_16k = resample(x, sr, target_sr)
14361434
if self._gender_model is not None:
14371435
gender = self._gender_model(x_16k)
@@ -1458,6 +1456,16 @@ def predict(
14581456
----------
14591457
original_audio: np.array or malaya_speech.model.frame.Frame
14601458
target_audio: np.array or malaya_speech.model.frame.Frame
1459+
modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
1460+
R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).
1461+
1462+
* ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
1463+
* ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
1464+
* ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
1465+
* ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
1466+
* ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
1467+
* ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
1468+
* ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`
14611469
14621470
Returns
14631471
-------
@@ -1475,8 +1483,8 @@ def predict(
14751483
target_audio = (
14761484
input.array if isinstance(target_audio, Frame) else target_audio
14771485
)
1478-
wav, mel, f0, v = get_speech(original_audio)
1479-
wav_1, mel_1, f0_1, v_1 = get_speech(target_audio)
1486+
wav, mel, f0, v = self._get_data(original_audio)
1487+
wav_1, mel_1, f0_1, v_1 = self._get_data(target_audio)
14801488
mels, mel_lens = padding_sequence_nd(
14811489
[mel, mel_1], dim = 0, return_len = True
14821490
)
@@ -1532,9 +1540,9 @@ def predict(
15321540
x_ = mels[:1]
15331541

15341542
r = self._execute(
1535-
inputs = [uttr_f0_, x_, v_, len(f0s[0])],
1543+
inputs = [uttr_f0_, x_, [v_], [len(f0s[0])]],
15361544
input_labels = ['uttr_f0', 'X', 'V', 'len_X'],
1537-
output_labels = ['f0_target'],
1545+
output_labels = ['mel_outputs'],
15381546
)
15391547
mel_outputs = r['mel_outputs'][0]
15401548
if 'R' in condition:

0 commit comments

Comments
 (0)