sevagh
diff --git a/‎.clang-format
Lines changed: 1 addition & 1 deletion b/‎.clang-format
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/SDR_scores.md
Lines changed: 50 additions & 0 deletions b/‎.github/SDR_scores.md
Lines changed: 50 additions & 0 deletions
diff --git a/‎CMakeLists.txt
Lines changed: 13 additions & 3 deletions b/‎CMakeLists.txt
Lines changed: 13 additions & 3 deletions
diff --git a/‎README.md
Lines changed: 39 additions & 3 deletions b/‎README.md
Lines changed: 39 additions & 3 deletions
diff --git a/‎cli-apps/demucs_ft.cpp
Lines changed: 4 additions & 1 deletion b/‎cli-apps/demucs_ft.cpp
Lines changed: 4 additions & 1 deletion
@@ -3,4 +3,4 @@ IndentWidth: 4
 BreakBeforeBraces: Allman
 AllowShortIfStatementsOnASingleLine: false
 IndentCaseLabels: false
-ColumnLimit: 80
+ColumnLimit: 80
@@ -59,3 +59,53 @@ drums           ==> SDR:  10.463  SIR:  19.782  ISR:  17.144  SAR:  11.132
 bass            ==> SDR:   4.584  SIR:   9.359  ISR:   9.068  SAR:   4.885
 other           ==> SDR:   7.426  SIR:  12.793  ISR:  12.975  SAR:   7.830
 ```
+
+### Performance of multi-threaded inference
+
+Zeno - Signs, Demucs 4s multi-threaded using the same strategy used in <https://freemusicdemixer.com>.
+
+Optimal performance: `export OMP_NUM_THREADS=4` + 4 threads via cli args for a total of 16 physical cores on my 5950X.
+
+This should be identical in SDR but still worth testing since multi-threaded large waveform segmentation may still impact demixing quality:
+```
+vocals          ==> SDR:   8.317  SIR:  18.089  ISR:  15.887  SAR:   8.391
+drums           ==> SDR:   9.987  SIR:  18.579  ISR:  16.997  SAR:  10.755
+bass            ==> SDR:   4.039  SIR:  12.531  ISR:   6.822  SAR:   3.090
+other           ==> SDR:   7.405  SIR:  11.246  ISR:  14.186  SAR:   8.099
+```
+
+Multi-threaded fine-tuned:
+```
+```
+
+### Time measurements
+
+Regular, big threads = 1, OMP threads = 16:
+```
+real    10m23.201s
+user    29m42.190s
+sys     4m17.248s
+```
+
+Fine-tuned, big threads = 1, OMP threads = 16: probably 4x the above, since it's just tautologically 4 Demucs models.
+
+Mt, big threads = 4, OMP threads = 4 (4x4 = 16):
+```
+real    4m9.331s
+user    18m59.731s
+sys     3m28.465s
+```
+
+Ft Mt, big threads = 4, OMP threads = 4 (4x4 = 16):
+```
+real    16m30.252s
+user    74m27.250s
+sys     14m40.643s
+```
+
+Mt, big threads = 8, OMP threads = 16:
+```
+real    4m9.304s
+user    43m21.830s
+sys     10m15.712s
+```
@@ -16,7 +16,7 @@ endif()
 set(CMAKE_CXX_FLAGS "-Wall -Wextra")
 set(CMAKE_CXX_FLAGS_DEBUG "-g -DEIGEN_FAST_MATH=0 -O0")
 
-set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -march=native -fno-unsafe-math-optimizations -fassociative-math -freciprocal-math -fno-signed-zeros")
+set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -march=native -fno-unsafe-math-optimizations -freciprocal-math -fno-signed-zeros")
 
 # define a macro NDEBUG for Eigen3 release builds
 set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -DNDEBUG")
@@ -91,14 +91,24 @@ add_executable(demucs_ft.cpp.main "cli-apps/demucs_ft.cpp")
 target_include_directories(demucs_ft.cpp.main PRIVATE vendor/libnyquist/include)
 target_link_libraries(demucs_ft.cpp.main demucs.cpp.lib libnyquist)
 
-file(GLOB SOURCES_TO_LINT "src/*.cpp" "src/*.hpp" "cli-apps/*.cpp")
+add_executable(demucs_mt.cpp.main "cli-apps/demucs_mt.cpp")
+target_include_directories(demucs_mt.cpp.main PRIVATE vendor/libnyquist/include)
+target_include_directories(demucs_mt.cpp.main PRIVATE cli-apps)
+target_link_libraries(demucs_mt.cpp.main demucs.cpp.lib libnyquist)
+
+add_executable(demucs_ft_mt.cpp.main "cli-apps/demucs_ft_mt.cpp")
+target_include_directories(demucs_ft_mt.cpp.main PRIVATE vendor/libnyquist/include)
+target_include_directories(demucs_ft_mt.cpp.main PRIVATE cli-apps)
+target_link_libraries(demucs_ft_mt.cpp.main demucs.cpp.lib libnyquist)
+
+file(GLOB SOURCES_TO_LINT "src/*.cpp" "src/*.hpp" "cli-apps/*.cpp" "cli-apps/*.hpp")
 
 # add target to run standard lints and formatters
 add_custom_target(lint
     COMMAND clang-format -i ${SOURCES_TO_LINT} --style=file
     # add clang-tidy command
     # add include dirs to clang-tidy
-    COMMAND cppcheck --enable=all --suppress=missingIncludeSystem ${SOURCES_TO_LINT} --std=c++17
+    COMMAND cppcheck -I"src/" -I"cli-apps/" --enable=all --suppress=missingIncludeSystem ${SOURCES_TO_LINT} --std=c++17
     COMMAND scan-build -o ${CMAKE_BINARY_DIR}/scan-build-report make -C ${CMAKE_BINARY_DIR}
     WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
 )
 
@@ -2,15 +2,17 @@
 
 C++17 implementation of the [Demucs v4 hybrid transformer](https://github.com/facebookresearch/demucs), a PyTorch neural network for music demixing. Similar project to [umx.cpp](https://github.com/sevagh/umx.cpp). This code powers my site <https://freemusicdemixer.com>.
 
-It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference.
+It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `htdemucs`, `htdemucs_6s`, and `htdemucs_ft` (4-source, 6-source, fine-tuned) to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference. There are also programs for multi-threaded Demucs inference using C++11's `std::thread`.
 
 **All Hybrid-Transformer weights** (4-source, 6-source, fine-tuned) are supported. See the [Convert weights](#convert-weights) section below. Demixing quality is nearly identical to PyTorch as shown in the [SDR scores doc](./.github/SDR_scores.md).
 
 ### Directory structure
 
-`src` contains the library for Demucs inference, and `cli-apps` contains two driver programs, which compile to:
+`src` contains the library for Demucs inference, and `cli-apps` contains four driver programs, which compile to:
 1. `demucs.cpp.main`: run a single model (4s, 6s, or a single fine-tuned model)
-2. `demucs_ft.cpp.main`: run all 4 fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
+1. `demucs_ft.cpp.main`: run all four fine-tuned models for `htdemucs_ft` inference, same as the BagOfModels idea of PyTorch Demucs
+1. `demucs_mt.cpp.main`: run a single model, multi-threaded
+1. `demucs_ft_mt.cpp.main`: run all four fine-tuned models, multi-threaded
 
 ### Multi-core, OpenMP, BLAS, etc.
 
@@ -21,6 +23,40 @@ If you have OpenMP and OpenBLAS installed, OpenBLAS might automatically use all
 
 See the [BLAS benchmarks doc](./.github/BLAS_benchmarks.md) for more details.
 
+### Multi-threading
+
+There are two new programs, `demucs_mt.cpp.main` and `demucs_ft_mt.cpp.main` that use C++11 [std::threads](https://en.cppreference.com/w/cpp/thread/thread).
+
+In the single-threaded programs:
+
+* User supplies a waveform of length N seconds
+* Waveform is split into 7.8-second segments for Demucs inference
+* Segments are processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
+
+In the multi-threaded programs:
+* User supplies a waveform of length N seconds and a `num_threads` argument
+* Waveform is split into `num_threads` sub-waveforms (of length M < N) to process in parallel with a 0.75-second overlap
+    * We always need overlapping segments in audio applications to eliminate [boundary artifacts](https://freemusicdemixer.com/under-the-hood/2024/02/23/Demucs-segmentation#boundary-artifacts-and-the-overlap-add-method)
+* `num_threads` threads are launched to perform Demucs inference on the sub-waveforms in parallel
+* Within each thread, the sub-waveform is split into 7.8-second segments
+* Segments within a thread are still processed sequentially, where each segment inference can use >1 core with `OMP_NUM_THREADS`
+
+For the single-threaded `demucs.cpp.main`, my suggestion is `OMP_NUM_THREADS=$num_physical_cores`. On my 5950X system with 16 cores, execution time for a 4-minute song:
+```
+real    10m23.201s
+user    29m42.190s
+sys     4m17.248s
+```
+
+For the multi-threaded `demucs_mt.cpp.main`, using 4 `std::thread` and OMP threads = 4 (4x4 = 16 physical cores):
+```
+real    4m9.331s
+user    18m59.731s
+sys     3m28.465s
+```
+
+More than 2x faster for 4 threads. This is inspired by the parallelism strategy used in <https://freemusicdemixer.com>.
+
 ## Instructions
 
 ### Build C++ code
 
@@ -133,7 +133,6 @@ int main(int argc, const char **argv)
 
     // iterate over all files in model_dir
     // and load the model
-    std::string model_file;
     for (const auto &entry : std::filesystem::directory_iterator(model_dir))
     {
         bool ret = false;
@@ -167,6 +166,10 @@ int main(int argc, const char **argv)
             std::cout << "Loading ft model " << entry.path().string()
                       << " for vocals" << std::endl;
         }
+        else
+        {
+            continue;
+        }
 
         // debug some members of model
         std::cout << "demucs_model_load returned " << (ret ? "true" : "false")