Skip to content

mesolitica/DistilCodec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DistilCodec

DistilCodec: A Single Codebook Audio Codec For Universal Audio

Paper | HuggingFace Model | Code

Institution 1

Institution 2 Institution 3

🔥 News

  • 2025.05.26: We release DistilCodec-v1.0 checkpoint on huggingface.
  • 2025.05.26: The paper is available on arxiv.
  • 2025.05.23: We submit paper to arxiv.

Introduction of DistilCodec

The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., and Shenzhen Yijiayiban Information Technology Co., Ltd, has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on universal audio. We also trained a TTS based on DistilCodec which called UniTTS. To better leverage the universal audio reconstruction capability of DistilCodec, UniTTS incorporates the universal audio autoregressive task in ALM-Pretrain. For details, please refer to our paper. The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure, while the vector quantization module implements the GRFVQ scheme. The decoder employs a ConvTranspose1d based architectural configuration similar to HiFiGAN The training methodology of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec: The Architecture of DistilCodec Distribution of DistilCodec training data is shown in below table:

Data Category Data Size (in hours)
Chinese Audiobook 38000
Chinese Common Audio 20000
English Audiobook 10000
English Speech 30000
Music 2000
Total 100000

Training Schema

We have developed a novel distillation approach termed DMS (Distilling Multi-Codebook NAC to Single-Codebook NAC) by enabling the Student NAC to inherit encoder and decoder parameters from the Teacher NAC. Based on DMS, we trained DistilCodec using universal audio datasets as training data, achieving a single codebook with a codebook size of 32,768 while maintaining codebook utilization approaching 100%. Simultaneously, the DMS algorithm enables the dimension of the distilled Student NAC Codebook to be scaled beyond 2048. Leveraging this capability, we configured the codebook dimension to 3584, aligning with the word embedding dimension of QWen2.5-7B (3584), so we subsequently leveraged DistilCodec's codebook to initialize the audio embedding layer in UniTTS. Here is the psuedo code of DMS:

Algorithm DMS: Distilling Multi-Codebook NAC to Single-Codebook NAC via parameter inheritance

  1. Step 1: Initializing Teacher codec:

    Step1 formula
  2. Step 2: Teacher codec training with LSGAN

  3. Step 3: Initializing Student codec:

    Step3 formula
  4. Step 4: Student codec training with DLF

  5. Output: DistilCodec = Student_codec

The parameter settings for the codebooks of Teacher Codec and Student Codec are as follows, where N-Residual indicates the number of residual layers, N-Group denotes the number of groups, N-Codes/Codebook represents the number of codes per codebook, and Dimension specifies the dimension of the codebook.

Codec N-Residual N-Group N-Codes/Codebook Dimension
Teacher-Codec 8 4 1024 512
Student-Codec 1 1 32768 3584

Evaluation and Demos

The second row of the table demonstrates the codebook utilization and perplexity (PPL) of DistilCodec evaluated on LibriSpeech-Test-Clean. Given DistilCodec's capability to process universal audio, we have constructed an integrated test set comprising speech, audiobook, and music samples for evaluating codebook utilization and PPL in universal audio scenarios. As shown in the table, DistilCodec achieves near-optimal codebook utilization (approaching 100%) across both datasets, accompanied by notably high PPL values (the theoretical maximum PPL equals the codebook size, which is 32,768). These results substantiate DistilCodec's superior audio reconstruction capabilities in universal audio applications.

Dataset Codebook Usage(%)↑ Codebook PPL↑
LibriSpeech-Clean-Test 98.2 21660.5
Universal-Audio-Test 99.9 26999.0

Additionally, we conducted a comprehensive comparative analysis of DistilCodec’s speech reconstruction capabilities using the LibriSpeech-Clean-Test benchmark.

Model Codebook Size Nq Token Rate (TPS) Bandwidth (bps) STOI ↑ PESQ ↑ UTMOS ↑
Encodec 1024 8 600 6000 0.94 2.75 3.07
DAC 1024 12 600 6000 0.95 4.01 4.00
Encodec 1024 2 150 1500 0.84 1.56 1.58
Mimi 2048 8 100 1100 0.91 2.25 3.56
BigCodec 8192 1 80 1040 0.94 2.68 4.11
DAC 1024 2 100 1000 0.73 1.14 1.29
SpeechTokenizer 1024 2 100 1000 0.77 1.25 2.28
X-codec 1024 2 100 1000 0.86 2.33 4.21
WavTokenizer 4096 1 75 900 0.89 2.14 3.94
X-codec2 65536 1 50 800 0.92 2.43 4.13
StableCodec 15625 2 50 697 0.91 2.24 4.23
Single-Codec 8192 1 23.4 304 0.86 1.88 3.72
BiCodec 8192 1 50 650 0.92 2.51 4.18
DistilCodec 32768 1 93 1300 0.93 2.02 3.75

Since DistilCodec was trained on universal audio, we first employed UTMOS for automatic quality assessment. However, the universal audio test set received an unreliable low score (1.89), indicating UTMOS's inadequacy for universal audio evaluation. We therefore conducted a Mean Opinion Score (MOS) evaluation, the results are shown:

Assessment Items Reconstructed Original
Speech Clarity 4.689 4.945
Background Audio Clarity 4.768 4.927
Average Score 4.728 4.936

Installation of DistilCodec

pip3 install git+https://github.com/mesolitica/DistilCodec

Inference of DistilCodec

Part1: Generate audio tokens using DistilCodec

from distilcodec import DistilCodec, demo_for_generate_audio_codes
from huggingface_hub import hf_hub_download

codec_model_config_path = hf_hub_download(repo_id="IDEA-Emdoor/DistilCodec-v1.0", filename='model_config.json')
codec_ckpt_path = hf_hub_download(repo_id="IDEA-Emdoor/DistilCodec-v1.0", filename='g_00204000')

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    use_generator=True,
    is_debug=False).eval()

audio_path = 'test.mp3'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=False
)
print(audio_tokens)

Part2: Reconstruct audio with audio tokens generated from DistilCodec

y_gen = codec.decode_from_codes(
    audio_tokens, 
    minus_token_offset=False
)

Available DistilCodec models

Model Version Huggingface Corpus Token/s Domain
DistilCodec-v1.0 HuggingFace Universal Audio 93 Universal Audio

References

The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects:

[1]vector-quantize-pytorch

[2]AcademiCodec

[3]fish-speech

Citation

If you find our work useful in your research, please cite our work:

@misc{wang2025unittsendtoendttsdecoupling,
      title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information}, 
      author={Rui Wang and Qianguo Sun and Tianrong Chen and Zhiyun Zeng and Junlong Wu and Jiaxing Zhang},
      year={2025},
      eprint={2505.17426},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2505.17426}, 
}

Disclaimer

DistilCodec provides the capability of universal audio discretion only for academic research purposes. We encourage the community to uphold safety and ethical principles in AI research and applications.

Important Notes:

  • Compliance with the model's open-source license is mandatory.

  • Unauthorized voice replication applications are strictly prohibited.

  • Developers bear no responsibility for any misuse of this model.

License

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information © 2025 by Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang is licensed under CC BY-NC-ND 4.0

About

A Neural Audio Codec (NAC) for Universal Audio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%