Skip to content

GxB_Context is essential: telling each GrB* method/operation which/how many OpenMP threads and GPU(s) to use #74

@DrTimothyAldenDavis

Description

@DrTimothyAldenDavis

We're making progress on the CUDA kernels for SuiteSparse:GraphBLAS.

The change to the C API will be very slight. See the new "cuda" branch on LAGraph, which is now synced with the latest LAGraph dev branch. It has a few changes the the LAGr_TriangleCount but those are just for testing.

A new mode:

GrB_init (GxB_NONBLOCKING_GPU) ;
GrB_init (GxB_BLOCKING_GPU) ;

For now, when the GPU is in use, SuiteSparse:GraphBLAS will ignore the malloc/calloc/realloc/free pointers given to it by GxB_init. Instead, it will always use rmm_wrap_malloc, rmm_wrap_calloc, etc, in the SuiteSparse/GraphBLAS/rmm_wrap folder. Those are C-callable wrappers for the Rapids Memory Manager.
LAGr_Init will need to be given the rmm_wrap_* methods, however.

That's it, so far. But I need more.

We need more control over when the GPU is used, and which GPU is used. In particular, I have 4 GPUs on my system. If a user application spawns 4 user threads, each could use its own GPU. (For now, our CUDA kernels exploit only a single GPU).

So how do I tell GrB_mxm to use, say, "cuda device 3"? I can't use the descriptor, since not all methods have a descriptor: GrB_wait, GrB_Matrix_build, GrB_Matrix_dup, GrB_Matrix_nvals, and so on, must all be told which GPU to use. GrB_Matrix_nvals needs to know this because it may need to do GrB_wait, which may do a lot of work.

For GrB_wait, GrB_Matrix_dup, and GrB_Matrix_nvals, we could try to keep track of which GPU goes with a particular matrix. We may want to that anyway, but it's awkward in general, since it doesn't extend well to GrB_mxm.

Enter the GxB_Context object, which I think solves this problem. Ideally this object should be passed to ALL GraphBLAS calls, for example:

GrB_mxm (context, C, M, accum, semiring, A, B, descriptor)

Something like that would be ideal, but adding a new parameter to each and every GrB* and GxB* call would be very disruptive. My solution is to add a GxB_Context object, and to place it in the user's thread-local-storage (threadprivate), just like what we did when GrB_error and GrB_wait had no input parameters in v1.0 of the C API.

// constructs a new context, also placing it into threadprivate storage
GxB_Context_new (&context) ; 

// To make all subsequent calls by this user thread use cuda device 3:
GxB_Context_set (context, GxB_GPU_DEVICE, 3) ; 

// To make all subsequent calls by this user thread use 4 openmp threads
GxB_Context_set (context, GxB_NTHREADS, 4) ;

// To free the user's threadprivate context object, either:
GxB_Context_free (&context) ;
GrB_free (&context) ;
// Freeing a context also sets the user's threadprivate context to NULL

And so on. I would still allow for my GxB_Global_Options_set to control the global number of openmp threads. So the precedence would be for a particular call to a GrB* or GxB* method or operation:

(1) if a GrB_Descriptor for a particular call has a non-default setting for # of openmp threads, or which (or how many) GPUs to use, then those settings are used. Not applicable for GrB_wait, GrB_dup, GrB_build, etc, which sadly don't have a descriptor. If the GrB_descriptor is present and has non-default settings, then the context (2) and global settings (3) are ignored.

(2) if the threadprivate context object exists (not NULL) and if it has non-default settings for # of openMP threads, which or how many GPUs to use, etc, then those settings are used, and the global settings (3) are ignored.

(3) otherwise, use the global settings, which apply to all calls to GrB* and GxB* from all user threads. This defaults to omp_get_max_threads for OpenMP, and (perhaps) "cuda device zero" or "no cuda device will be used" as a default.

This GxB_Context would make a small change to the API, and it would not break backward compatibility with the v2.0 C API.

I may also want to add hints to tell a matrix or vector where to live, as in:

GxB_set (A, GxB_GPU_DEVICE, 3) ;

that would give GraphBLAS a hint that the GrB_Matrix A would like to live on GPU device 3. It would be just a hint and I could ignore it if I like. I'm not sure about GrB_mxm where all 4 matrices might live on different GPUs. Perhaps GxB_set (A, GxB_GPU_DEVICE,3) would just tell me to tell the Rapids memory manager where the data for this matrix should migrate to. I would not have to do anything else, just let RMM handle the rest.

I haven't implemented this GxB_Context object yet but I'm going to start work on it soon. It is absolutely essential to let GraphBLAS use the GPU, and it will greatly enhance the parallel-library composability of GraphBLAS when using OpenMP.

In the future, for a v3.0 C API, we could add the context object as the first (or last, as you like) parameter to ALL GrB functions, even the seemingly trivial ones like GrB_Type_new that don't seem like they need it. It would be odd to have it in GrB_Matrix_nvals but not in other GrB methods.

See also #48 which is closely related to this issue (another way to solve it). I think issue #48 is not the best way to solve this problem, however. The descriptor should be specific to an individual call to GrB_*. Pre-defined descriptors are read-only and handy to use. Trying to fit the context into the descriptor makes this a little awkward.

In a future distributed-memory API, the context could contain things like an MPI communicator, but I haven't thought through how that would work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions