Skip to content

Commit 6de86de

Browse files
authored
Threaded inference (#10)
1 parent 5cf8cb7 commit 6de86de

17 files changed

+864
-63
lines changed

.clang-format

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ IndentWidth: 4
33
BreakBeforeBraces: Allman
44
AllowShortIfStatementsOnASingleLine: false
55
IndentCaseLabels: false
6-
ColumnLimit: 80
6+
ColumnLimit: 80

.github/SDR_scores.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,53 @@ drums ==> SDR: 10.463 SIR: 19.782 ISR: 17.144 SAR: 11.132
5959
bass ==> SDR: 4.584 SIR: 9.359 ISR: 9.068 SAR: 4.885
6060
other ==> SDR: 7.426 SIR: 12.793 ISR: 12.975 SAR: 7.830
6161
```
62+
63+
### Performance of multi-threaded inference
64+
65+
Zeno - Signs, Demucs 4s multi-threaded using the same strategy used in <https://freemusicdemixer.com>.
66+
67+
Optimal performance: `export OMP_NUM_THREADS=4` + 4 threads via cli args for a total of 16 physical cores on my 5950X.
68+
69+
This should be identical in SDR but still worth testing since multi-threaded large waveform segmentation may still impact demixing quality:
70+
```
71+
vocals ==> SDR: 8.317 SIR: 18.089 ISR: 15.887 SAR: 8.391
72+
drums ==> SDR: 9.987 SIR: 18.579 ISR: 16.997 SAR: 10.755
73+
bass ==> SDR: 4.039 SIR: 12.531 ISR: 6.822 SAR: 3.090
74+
other ==> SDR: 7.405 SIR: 11.246 ISR: 14.186 SAR: 8.099
75+
```
76+
77+
Multi-threaded fine-tuned:
78+
```
79+
```
80+
81+
### Time measurements
82+
83+
Regular, big threads = 1, OMP threads = 16:
84+
```
85+
real 10m23.201s
86+
user 29m42.190s
87+
sys 4m17.248s
88+
```
89+
90+
Fine-tuned, big threads = 1, OMP threads = 16: probably 4x the above, since it's just tautologically 4 Demucs models.
91+
92+
Mt, big threads = 4, OMP threads = 4 (4x4 = 16):
93+
```
94+
real 4m9.331s
95+
user 18m59.731s
96+
sys 3m28.465s
97+
```
98+
99+
Ft Mt, big threads = 4, OMP threads = 4 (4x4 = 16):
100+
```
101+
real 16m30.252s
102+
user 74m27.250s
103+
sys 14m40.643s
104+
```
105+
106+
Mt, big threads = 8, OMP threads = 16:
107+
```
108+
real 4m9.304s
109+
user 43m21.830s
110+
sys 10m15.712s
111+
```

CMakeLists.txt

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ endif()
1616
set(CMAKE_CXX_FLAGS "-Wall -Wextra")
1717
set(CMAKE_CXX_FLAGS_DEBUG "-g -DEIGEN_FAST_MATH=0 -O0")
1818

19-
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -march=native -fno-unsafe-math-optimizations -fassociative-math -freciprocal-math -fno-signed-zeros")
19+
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -march=native -fno-unsafe-math-optimizations -freciprocal-math -fno-signed-zeros")
2020

2121
# define a macro NDEBUG for Eigen3 release builds
2222
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -DNDEBUG")
@@ -91,14 +91,24 @@ add_executable(demucs_ft.cpp.main "cli-apps/demucs_ft.cpp")
9191
target_include_directories(demucs_ft.cpp.main PRIVATE vendor/libnyquist/include)
9292
target_link_libraries(demucs_ft.cpp.main demucs.cpp.lib libnyquist)
9393

94-
file(GLOB SOURCES_TO_LINT "src/*.cpp" "src/*.hpp" "cli-apps/*.cpp")
94+
add_executable(demucs_mt.cpp.main "cli-apps/demucs_mt.cpp")
95+
target_include_directories(demucs_mt.cpp.main PRIVATE vendor/libnyquist/include)
96+
target_include_directories(demucs_mt.cpp.main PRIVATE cli-apps)
97+
target_link_libraries(demucs_mt.cpp.main demucs.cpp.lib libnyquist)
98+
99+
add_executable(demucs_ft_mt.cpp.main "cli-apps/demucs_ft_mt.cpp")
100+
target_include_directories(demucs_ft_mt.cpp.main PRIVATE vendor/libnyquist/include)
101+
target_include_directories(demucs_ft_mt.cpp.main PRIVATE cli-apps)
102+
target_link_libraries(demucs_ft_mt.cpp.main demucs.cpp.lib libnyquist)
103+
104+
file(GLOB SOURCES_TO_LINT "src/*.cpp" "src/*.hpp" "cli-apps/*.cpp" "cli-apps/*.hpp")
95105

96106
# add target to run standard lints and formatters
97107
add_custom_target(lint
98108
COMMAND clang-format -i ${SOURCES_TO_LINT} --style=file
99109
# add clang-tidy command
100110
# add include dirs to clang-tidy
101-
COMMAND cppcheck --enable=all --suppress=missingIncludeSystem ${SOURCES_TO_LINT} --std=c++17
111+
COMMAND cppcheck -I"src/" -I"cli-apps/" --enable=all --suppress=missingIncludeSystem ${SOURCES_TO_LINT} --std=c++17
102112
COMMAND scan-build -o ${CMAKE_BINARY_DIR}/scan-build-report make -C ${CMAKE_BINARY_DIR}
103113
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
104114
)

README.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,17 @@
22

33
C++17 implementation of the [Demucs v4 hybrid transformer](https://github.com/facebookresearch/demucs), a PyTorch neural network for music demixing. Similar project to [umx.cpp](https://github.com/sevagh/umx.cpp). This code powers my site <https://freemusicdemixer.com>.
44

5-
It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference.
5+
It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference. There are also programs for multi-threaded Demucs inference using C++11's `std::thread`.
66

77
**All Hybrid-Transformer weights** (4-source, 6-source, fine-tuned) are supported. See the [Convert weights](#convert-weights) section below. Demixing quality is nearly identical to PyTorch as shown in the [SDR scores doc](./.github/SDR_scores.md).
88

99
### Directory structure
1010

11-
`src` contains the library for Demucs inference, and `cli-apps` contains two driver programs, which compile to:
11+
`src` contains the library for Demucs inference, and `cli-apps` contains four driver programs, which compile to:
1212
1. `demucs.cpp.main`: run a single model (4s, 6s, or a single fine-tuned model)
13-
2. `demucs_ft.cpp.main`: run all 4 fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
13+
1. `demucs_ft.cpp.main`: run all four fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
14+
1. `demucs_mt.cpp.main`: run a single model, multi-threaded
15+
1. `demucs_ft_mt.cpp.main`: run all four fine-tuned models, multi-threaded
1416

1517
### Multi-core, OpenMP, BLAS, etc.
1618

@@ -21,6 +23,40 @@ If you have OpenMP and OpenBLAS installed, OpenBLAS might automatically use all
2123
2224
See the [BLAS benchmarks doc](./.github/BLAS_benchmarks.md) for more details.
2325

26+
### Multi-threading
27+
28+
There are two new programs, `demucs_mt.cpp.main` and `demucs_ft_mt.cpp.main` that use C++11 [std::threads](https://en.cppreference.com/w/cpp/thread/thread).
29+
30+
In the single-threaded programs:
31+
32+
* User supplies a waveform of length N seconds
33+
* Waveform is split into 7.8-second segments for Demucs inference
34+
* Segments are processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
35+
36+
In the multi-threaded programs:
37+
* User supplies a waveform of length N seconds and a `num_threads` argument
38+
* Waveform is split into `num_threads` sub-waveforms (of length M < N) to process in parallel with a 0.75-second overlap
39+
* We always need overlapping segments in audio applications to eliminate [boundary artifacts](https://freemusicdemixer.com/under-the-hood/2024/02/23/Demucs-segmentation#boundary-artifacts-and-the-overlap-add-method)
40+
* `num_threads` threads are launched to perform Demucs inference on the sub-waveforms in parallel
41+
* Within each thread, the sub-waveform is split into 7.8-second segments
42+
* Segments within a thread are still processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
43+
44+
For the single-threaded `demucs.cpp.main`, my suggestion is `OMP_NUM_THREADS=$num_physical_cores`. On my 5950X system with 16 cores, execution time for a 4-minute song:
45+
```
46+
real 10m23.201s
47+
user 29m42.190s
48+
sys 4m17.248s
49+
```
50+
51+
For the multi-threaded `demucs_mt.cpp.main`, using 4 `std::thread` and OMP threads = 4 (4x4 = 16 physical cores):
52+
```
53+
real 4m9.331s
54+
user 18m59.731s
55+
sys 3m28.465s
56+
```
57+
58+
More than 2x faster for 4 threads. This is inspired by the parallelism strategy used in <https://freemusicdemixer.com>.
59+
2460
## Instructions
2561

2662
### Build C++ code

cli-apps/demucs_ft.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,6 @@ int main(int argc, const char **argv)
133133

134134
// iterate over all files in model_dir
135135
// and load the model
136-
std::string model_file;
137136
for (const auto &entry : std::filesystem::directory_iterator(model_dir))
138137
{
139138
bool ret = false;
@@ -167,6 +166,10 @@ int main(int argc, const char **argv)
167166
std::cout << "Loading ft model " << entry.path().string()
168167
<< " for vocals" << std::endl;
169168
}
169+
else
170+
{
171+
continue;
172+
}
170173

171174
// debug some members of model
172175
std::cout << "demucs_model_load returned " << (ret ? "true" : "false")

0 commit comments

Comments
 (0)