Fix ParILUT threshold select for larger matrices with GPU backends #1877

nbeams · 2025-07-02T20:09:45Z

This PR addresses an issue that can occur with the threshold_select kernel of ParILUT, leading to non-deterministic behavior and different output than the CPU backends. It only happens when the matrix is large enough to end up inside the while (bucket.size > kernel::basecase_size) loop, which is why it isn't triggered by any of the unit tests. However, if I adjust the test_select calls in test/factorization/par_ilut_kernels.cpp to use a bigger matrix, I get

[----------] Global test environment tear-down
[==========] 256 tests from 16 test suites ran. (156180 ms total)
[  PASSED  ] 240 tests.
[  FAILED  ] 16 tests, listed below:
[  FAILED  ] ParIlut/<gko::half, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::half,int>
[  FAILED  ] ParIlut/<gko::half, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::half,long>
[  FAILED  ] ParIlut/<gko::bfloat16, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::bfloat16,int>
[  FAILED  ] ParIlut/<gko::bfloat16, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::bfloat16,long>
[  FAILED  ] ParIlut/<double, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<double,int>
[  FAILED  ] ParIlut/<double, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<double,long>
[  FAILED  ] ParIlut/<float, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<float,int>
[  FAILED  ] ParIlut/<float, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<float,long>
[  FAILED  ] ParIlut/<std::complex<gko::half>, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<gko::half>,int>
[  FAILED  ] ParIlut/<std::complex<gko::half>, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<gko::half>,long>
[  FAILED  ] ParIlut/<std::complex<gko::bfloat16>, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<gko::bfloat16>,int>
[  FAILED  ] ParIlut/<std::complex<gko::bfloat16>, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<gko::bfloat16>,long>
[  FAILED  ] ParIlut/<std::complex<double>, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<double>,int>
[  FAILED  ] ParIlut/<std::complex<double>, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<double>,long>
[  FAILED  ] ParIlut/<std::complex<float>, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<float>,int>
[  FAILED  ] ParIlut/<std::complex<float>, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<std::complex<float>,long>

when testing for CUDA.

After applying the changes in this branch:

[----------] Global test environment tear-down
[==========] 256 tests from 16 test suites ran. (156397 ms total)
[  PASSED  ] 256 tests.

I haven't tested the dpcpp version, but this part of the code is the same, so I updated it as well.

Note, this doesn't change the default behavior of ParILUT; it will only matter if the user turns off approximate_select.

MarcelKoch

Could you also change the tests so that we have the original issue? The tests you posted didn't show a large difference regarding their runtime.

nbeams · 2025-07-03T14:14:50Z

Since the test is using gko::test::generate_random_lower_triangular_matrix, with std::uniform_int_distribution to determine the number of non-zeros per row, I believe that means we can't guarantee we will always get the same number of non-zeros when running the test on different systems, right? That's what I want -- a way to guarantee the final nnz is greater than a certain value. What would be the preferred way to handle that? Should I just make the bounds of the uniform_int_distribution the same so I can know exactly how many non-zeros my triangular matrix will have? And then only use this for the threshold select tests? Or is there a better way to do that within the Ginkgo testing framework?

MarcelKoch · 2025-07-16T11:24:05Z

I think what you describe, i.e. increasing the lower bound for the generator, is just fine. You could just have two different distributions, if you want to use less nnz for the other tests.

…and dpcpp

nbeams · 2025-07-16T23:08:44Z

Figuring out what sort of combination of options would produce a matrix to trigger the bug (without resorting to a super large matrix) ended up being more challenging than I expected. I based the range of values on the nine point stencil example (which I originally modified to be my tester for finding/testing this bug) and ended up winging it a bit from there. But here is the output after 579e996 (testing on ICL guyot):

[----------] Global test environment tear-down
[==========] 256 tests from 16 test suites ran. (50639 ms total)
[  PASSED  ] 248 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] ParIlut/<gko::half, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::half,int>
[  FAILED  ] ParIlut/<gko::half, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::half,long>
[  FAILED  ] ParIlut/<gko::bfloat16, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::bfloat16,int>
[  FAILED  ] ParIlut/<gko::bfloat16, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<gko::bfloat16,long>
[  FAILED  ] ParIlut/<double, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<double,int>
[  FAILED  ] ParIlut/<double, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<double,long>
[  FAILED  ] ParIlut/<float, int>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<float,int>
[  FAILED  ] ParIlut/<float, long>.KernelThresholdSelectIsEquivalentToRef, where TypeParam = std::tuple<float,long>

 8 FAILED TESTS

And after be8d2bb,

[----------] Global test environment tear-down
[==========] 256 tests from 16 test suites ran. (50456 ms total)
[  PASSED  ] 256 tests.

I also realized that I never added myself to the contributors list. Should I do that here? Or, looking at the list, perhaps it is time for a general update for emails/affiliations that have changed -- which could be a separate PR just for that?

MarcelKoch · 2025-07-17T07:21:08Z

Yes, please add yourself to the contributors list. The rest should be updated in a separate PR.

nbeams added mod:cuda This is related to the CUDA module. 1:ST:ready-for-review This PR is ready for review mod:hip This is related to the HIP module. mod:dpcpp This is related to the DPC++ module. is:bugfix This fixes a bug labels Jul 2, 2025

ginkgo-bot added the type:factorization This is related to the Factorizations label Jul 2, 2025

MarcelKoch assigned nbeams Jul 3, 2025

MarcelKoch reviewed Jul 3, 2025

View reviewed changes

nbeams added 2 commits July 16, 2025 22:43

Adjust threshold select unit test to catch the bug

579e996

Use the new bucket's idx in call to sampleselect_filter for cuda/hip …

be8d2bb

…and dpcpp

nbeams force-pushed the fix-threshold-select-gpu branch from 847b4dc to be8d2bb Compare July 16, 2025 22:44

MarcelKoch approved these changes Jul 17, 2025

View reviewed changes

Add Natalie to contributors

c4efadb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ParILUT threshold select for larger matrices with GPU backends #1877

Fix ParILUT threshold select for larger matrices with GPU backends #1877

Uh oh!

nbeams commented Jul 2, 2025

Uh oh!

MarcelKoch left a comment

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

MarcelKoch commented Jul 16, 2025

Uh oh!

nbeams commented Jul 16, 2025 •

edited

Loading

Uh oh!

MarcelKoch commented Jul 17, 2025

Uh oh!

Uh oh!

Fix ParILUT threshold select for larger matrices with GPU backends #1877

Are you sure you want to change the base?

Fix ParILUT threshold select for larger matrices with GPU backends #1877

Uh oh!

Conversation

nbeams commented Jul 2, 2025

Uh oh!

MarcelKoch left a comment

Choose a reason for hiding this comment

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

MarcelKoch commented Jul 16, 2025

Uh oh!

nbeams commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcelKoch commented Jul 17, 2025

Uh oh!

Uh oh!

nbeams commented Jul 16, 2025 •

edited

Loading