Skip to content

Commit fedd352

Browse files
committed
Android app + doc cleanup
1 parent 2483139 commit fedd352

File tree

5 files changed

+68
-58
lines changed

5 files changed

+68
-58
lines changed

.github/BLAS_benchmarks.md

Lines changed: 0 additions & 9 deletions
This file was deleted.

.github/PERFORMANCE.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
### Multi-core, OpenMP, BLAS, etc.
2+
3+
:warning: `demucs.cpp` library code in `./src` **should not use any threading (e.g. pthread or OpenMP) except through the BLAS interface.** This is because demucs.cpp is compiled to a single-threaded WebAssembly module in <https://freemusicdemixer.com>.
4+
5+
If you have OpenMP and OpenBLAS installed, OpenBLAS might automatically use all of the threads on your machine, which doesn't always run the fastest. Use the `OMP_NUM_THREADS` environment variable to limit this. On my 16c/32t machine, I found `OMP_NUM_THREADS=16` to be the fastest. This matches the [Eigen recommendation](https://eigen.tuxfamily.org/dox/TopicMultiThreading.html) to use the same number of threads as physical cores:
6+
>On most OS it is very important to limit the number of threads to the number of physical cores, otherwise significant slowdowns are expected, especially for operations involving dense matrices.
7+
8+
### BLAS benchmarks
9+
10+
The benchmark plots below show the performance of different BLAS libraries (OpenBLAS, Intel MKL, AMD AOCL BLIS) with different numbers of threads on my Ryzen Zen3 5950X (16c/32t). In my case, 16 threads with OpenBLAS is a good blend of performance and memory usage.
11+
12+
<img alt="bench-wall-time" src="./wall_time_comparison.png" width="500"/>
13+
<img alt="bench-cpu-time" src="./cpu_time_comparison.png" width="500"/>
14+
<img alt="bench-memory" src="./memory_usage_comparison.png" width="500"/>
15+
16+
I didn't include any GPU BLAS libraries (NVBLAS, cuBLAS, etc.) because the I'm limiting the scope of demucs.cpp to use only the CPU. The real PyTorch version of Demucs is suitable for GPU acceleration.
17+
18+
### GPUs, cuBLAS, NVBLAS
19+
20+
There is a [branch](https://github.com/sevagh/demucs.cpp/tree/nvblas) where I explored NVBLAS (a cuBLAS wrapper with automatic host-GPU memory transfers). It's not very useful, but it's what I expect. Demucs.cpp is heavy on the for-loops and small matrix-vector or matrix-matrix multiplications. This is to run on Android phones (typically with small amounts of memory, 6-8 GB on flagships) and in WebAssembly (which has a 4 GB memory limit per module).
21+
22+
If I wrote it to use large matrix broadcasts, it would probably be faster (while consuming more memory and breaking the intended usecase), and accelerate much better on GPUs.
23+
24+
### Multi-threading
25+
26+
There are two new programs, `demucs_mt.cpp.main` and `demucs_ft_mt.cpp.main` that use C++11 [std::threads](https://en.cppreference.com/w/cpp/thread/thread).
27+
28+
In the single-threaded programs:
29+
30+
* User supplies a waveform of length N seconds
31+
* Waveform is split into 7.8-second segments for Demucs inference
32+
* Segments are processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
33+
34+
In the multi-threaded programs:
35+
* User supplies a waveform of length N seconds and a `num_threads` argument
36+
* Waveform is split into `num_threads` sub-waveforms (of length M < N) to process in parallel with a 0.75-second overlap
37+
* We always need overlapping segments in audio applications to eliminate [boundary artifacts](https://freemusicdemixer.com/under-the-hood/2024/02/23/Demucs-segmentation#boundary-artifacts-and-the-overlap-add-method)
38+
* `num_threads` threads are launched to perform Demucs inference on the sub-waveforms in parallel
39+
* Within each thread, the sub-waveform is split into 7.8-second segments
40+
* Segments within a thread are still processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
41+
42+
For the single-threaded `demucs.cpp.main`, my suggestion is `OMP_NUM_THREADS=$num_physical_cores`. On my 5950X system with 16 cores, execution time for a 4-minute song:
43+
```
44+
real 10m23.201s
45+
user 29m42.190s
46+
sys 4m17.248s
47+
```
48+
49+
For the multi-threaded `demucs_mt.cpp.main`, using 4 `std::thread` and OMP threads = 4 (4x4 = 16 physical cores):
50+
```
51+
real 4m9.331s
52+
user 18m59.731s
53+
sys 3m28.465s
54+
```
55+
56+
More than 2x faster for 4 threads. This is inspired by the parallelism strategy used in <https://freemusicdemixer.com>.

.github/android-screenshot.png

276 KB
Loading

.github/google-play-badge.png

4.79 KB
Loading

README.md

Lines changed: 12 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
# demucs.cpp
22

3-
C++17 implementation of the [Demucs v4 hybrid transformer](https://github.com/facebookresearch/demucs), a PyTorch neural network for music demixing. Similar project to [umx.cpp](https://github.com/sevagh/umx.cpp). This code powers my site <https://freemusicdemixer.com>.
3+
C++17 library that implements the inference of the [Demucs v4 hybrid transformer model](https://github.com/facebookresearch/demucs), a PyTorch neural network for music demixing.
4+
5+
It uses only the standard library and the header-only library [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) as dependencies, making it suitable to compile and run on many platforms. It was designed for low-memory environments by sacrificing the speed of the Torch implementation.
6+
7+
Demucs.cpp powers my websites (<https://freemusicdemixer.com>, <https://pro.freemusicdemixer.com>) and now my new Android app [Music Demixer](https://play.google.com/store/apps/details?id=com.freemusicdemixer.pro) to bring Demucs to your pocket!
8+
9+
<a href="https://play.google.com/store/apps/details?id=com.freemusicdemixer.pro"><img src=".github/android-screenshot.png" width="128px" alt="music-demixer-android"/></a> <a href="https://play.google.com/store/apps/details?id=com.freemusicdemixer.pro"><img alt="google-play-badge" width="150px" src=".github/google-play-badge.png"/></a>
10+
11+
See my other project [umx.cpp](https://github.com/sevagh/umx.cpp) for a similar library for Open-Unmix.
12+
13+
### Library design
414

515
It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference. There are also programs for multi-threaded Demucs inference using C++11's `std::thread`.
616

@@ -14,48 +24,7 @@ It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio f
1424
1. `demucs_mt.cpp.main`: run a single model, multi-threaded
1525
1. `demucs_ft_mt.cpp.main`: run all four fine-tuned models, multi-threaded
1626

17-
### Multi-core, OpenMP, BLAS, etc.
18-
19-
:warning: `demucs.cpp` library code in `./src` **should not use any threading (e.g. pthread or OpenMP) except through the BLAS interface.** This is because demucs.cpp is compiled to a single-threaded WebAssembly module in <https://freemusicdemixer.com>.
20-
21-
If you have OpenMP and OpenBLAS installed, OpenBLAS might automatically use all of the threads on your machine, which doesn't always run the fastest. Use the `OMP_NUM_THREADS` environment variable to limit this. On my 16c/32t machine, I found `OMP_NUM_THREADS=16` to be the fastest. This matches the [Eigen recommendation](https://eigen.tuxfamily.org/dox/TopicMultiThreading.html) to use the same number of threads as physical cores:
22-
>On most OS it is very important to limit the number of threads to the number of physical cores, otherwise significant slowdowns are expected, especially for operations involving dense matrices.
23-
24-
See the [BLAS benchmarks doc](./.github/BLAS_benchmarks.md) for more details.
25-
26-
### Multi-threading
27-
28-
There are two new programs, `demucs_mt.cpp.main` and `demucs_ft_mt.cpp.main` that use C++11 [std::threads](https://en.cppreference.com/w/cpp/thread/thread).
29-
30-
In the single-threaded programs:
31-
32-
* User supplies a waveform of length N seconds
33-
* Waveform is split into 7.8-second segments for Demucs inference
34-
* Segments are processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
35-
36-
In the multi-threaded programs:
37-
* User supplies a waveform of length N seconds and a `num_threads` argument
38-
* Waveform is split into `num_threads` sub-waveforms (of length M < N) to process in parallel with a 0.75-second overlap
39-
* We always need overlapping segments in audio applications to eliminate [boundary artifacts](https://freemusicdemixer.com/under-the-hood/2024/02/23/Demucs-segmentation#boundary-artifacts-and-the-overlap-add-method)
40-
* `num_threads` threads are launched to perform Demucs inference on the sub-waveforms in parallel
41-
* Within each thread, the sub-waveform is split into 7.8-second segments
42-
* Segments within a thread are still processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
43-
44-
For the single-threaded `demucs.cpp.main`, my suggestion is `OMP_NUM_THREADS=$num_physical_cores`. On my 5950X system with 16 cores, execution time for a 4-minute song:
45-
```
46-
real 10m23.201s
47-
user 29m42.190s
48-
sys 4m17.248s
49-
```
50-
51-
For the multi-threaded `demucs_mt.cpp.main`, using 4 `std::thread` and OMP threads = 4 (4x4 = 16 physical cores):
52-
```
53-
real 4m9.331s
54-
user 18m59.731s
55-
sys 3m28.465s
56-
```
57-
58-
More than 2x faster for 4 threads. This is inspired by the parallelism strategy used in <https://freemusicdemixer.com>.
27+
See the [PERFORMANCE doc](./.github/PERFORMANCE.md) for details on multi-threading, external BLAS libraries, etc..
5928

6029
## Instructions
6130

@@ -149,9 +118,3 @@ Encoder Status: 0
149118
```
150119

151120
For the 6-source model, additional targets 4 and 5 correspond to guitar and piano.
152-
153-
## Dev tips
154-
155-
* make lint
156-
* Valgrind memory error test: `valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./demucs.cpp.main ../ggml-demucs/ggml-model-htdemucs-f16.bin ../test/data/gspi_stereo.wav ./demucs-out-cpp/`
157-
* Callgrind + KCachegrind: `valgrind --tool=callgrind ./demucs.cpp.test --gtest_filter='*FreqDec*'`

0 commit comments

Comments
 (0)