The parallel implementation in C++ with thread pools calculates the wrong gradient sometimes. In some benchmarks, the performance with parallel execution is worse than the serial one.