Skip to content

Failure when using more than 1 GPU in STRUMPACK MPI #126

@jinghu4

Description

@jinghu4

Hi, Dr. Ghysels,

I have seen some issues when using multi-GPU feature of STRUMPACK to solve a sparse matrix. I built STRUMPACK successfully with support of SLATE and MAGMA.

  1. When I run the test cases in STRUMPACK, "make test", the sparse_mpi and reuse_structure_mpi both failed.
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.178864 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
CUDA assertion failed: invalid resource handle ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cu 114
[gpu01:2817703] *** Process received signal ***

However, it passes when I run with one GPU: "
OMP_NUM_THREADS=1 mpirun -n 1 test_structure_reuse_mpi pde900.mtx

  1. Random failure when solving a sparse matrix with STRUMPACK multi-gpu
    Example: I try using 2 GPUs:
mpirun -n 2 --mca pml ucx myApplication.exe

a) sometimes it passes

OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
# DenseMPI factorization complete, GPU=1, P=2, T=10: 0.170223 seconds, 0.00550864 GFLOPS, 0.0323613 GFLOP/s,  ds=203, du=0 

(Why GPU =1 here? Does it mean, it only use one GPU but two processes are run on each og gpus I request? )

b) sometimes it fails with error msg

# multifrontal factorization:
#   - estimated memory usage (exact solver) = 23.5596 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
cuSOLVER assertion failed: 6 ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cpp 614
CUSOLVER_STATUS_EXECUTION_FAILED

Do you know what the reasons could be, causing these issues and how should I resolve them?

Best,
-Jing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions