You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+39-3Lines changed: 39 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,17 @@
2
2
3
3
C++17 implementation of the [Demucs v4 hybrid transformer](https://github.com/facebookresearch/demucs), a PyTorch neural network for music demixing. Similar project to [umx.cpp](https://github.com/sevagh/umx.cpp). This code powers my site <https://freemusicdemixer.com>.
4
4
5
-
It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference.
5
+
It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference. There are also programs for multi-threaded Demucs inference using C++11's `std::thread`.
6
6
7
7
**All Hybrid-Transformer weights** (4-source, 6-source, fine-tuned) are supported. See the [Convert weights](#convert-weights) section below. Demixing quality is nearly identical to PyTorch as shown in the [SDR scores doc](./.github/SDR_scores.md).
8
8
9
9
### Directory structure
10
10
11
-
`src` contains the library for Demucs inference, and `cli-apps` contains two driver programs, which compile to:
11
+
`src` contains the library for Demucs inference, and `cli-apps` contains four driver programs, which compile to:
12
12
1.`demucs.cpp.main`: run a single model (4s, 6s, or a single fine-tuned model)
13
-
2.`demucs_ft.cpp.main`: run all 4 fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
13
+
1.`demucs_ft.cpp.main`: run all four fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
14
+
1.`demucs_mt.cpp.main`: run a single model, multi-threaded
15
+
1.`demucs_ft_mt.cpp.main`: run all four fine-tuned models, multi-threaded
14
16
15
17
### Multi-core, OpenMP, BLAS, etc.
16
18
@@ -21,6 +23,40 @@ If you have OpenMP and OpenBLAS installed, OpenBLAS might automatically use all
21
23
22
24
See the [BLAS benchmarks doc](./.github/BLAS_benchmarks.md) for more details.
23
25
26
+
### Multi-threading
27
+
28
+
There are two new programs, `demucs_mt.cpp.main` and `demucs_ft_mt.cpp.main` that use C++11 [std::threads](https://en.cppreference.com/w/cpp/thread/thread).
29
+
30
+
In the single-threaded programs:
31
+
32
+
* User supplies a waveform of length N seconds
33
+
* Waveform is split into 7.8-second segments for Demucs inference
34
+
* Segments are processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
35
+
36
+
In the multi-threaded programs:
37
+
* User supplies a waveform of length N seconds and a `num_threads` argument
38
+
* Waveform is split into `num_threads` sub-waveforms (of length M < N) to process in parallel with a 0.75-second overlap
39
+
* We always need overlapping segments in audio applications to eliminate [boundary artifacts](https://freemusicdemixer.com/under-the-hood/2024/02/23/Demucs-segmentation#boundary-artifacts-and-the-overlap-add-method)
40
+
*`num_threads` threads are launched to perform Demucs inference on the sub-waveforms in parallel
41
+
* Within each thread, the sub-waveform is split into 7.8-second segments
42
+
* Segments within a thread are still processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
43
+
44
+
For the single-threaded `demucs.cpp.main`, my suggestion is `OMP_NUM_THREADS=$num_physical_cores`. On my 5950X system with 16 cores, execution time for a 4-minute song:
45
+
```
46
+
real 10m23.201s
47
+
user 29m42.190s
48
+
sys 4m17.248s
49
+
```
50
+
51
+
For the multi-threaded `demucs_mt.cpp.main`, using 4 `std::thread` and OMP threads = 4 (4x4 = 16 physical cores):
52
+
```
53
+
real 4m9.331s
54
+
user 18m59.731s
55
+
sys 3m28.465s
56
+
```
57
+
58
+
More than 2x faster for 4 threads. This is inspired by the parallelism strategy used in <https://freemusicdemixer.com>.
0 commit comments